-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Currently we set a retry limit for batch processing jobs and increment a run_count
field in our log database to track attempts for Sentinel, Landsat AC and Landsat MGRS processing. This limit is intended to prevent infinitely retrying a job where there is an underlying issue with the granule that prevents it from being processed (invalid or corrupted data).
Currently, the run_count
field is incremented for each update to the relevant log table. We don't distinguish between a failure caused by an underlying granule issue and a job which has failed due to its SPOT instance being interrupted. Normally this is not an issue since we use a relatively high retry limit of 5 and the failures are only retried 1 or 2 times per 24 hour period depending on configuration. We have seen cases where we have had several successive SPOT interruptions and the run_count
has exceeded the retry limit so jobs which should be retried again but are not. These failures can be identified by using a query similar to
SELECT run_count, count(*) FROM landsat_ac_log where jobinfo is not null and jobinfo->'Container'->>'ExitCode' is null group by run_count;
The run_count
increment logic in the logging functions should be altered to not increment the field for jobs which have failed due to SPOT interruptions.