Skip to content

Handle SPOT market interruptions differently than job failures. #212

@sharkinsspatial

Description

@sharkinsspatial

Currently we set a retry limit for batch processing jobs and increment a run_count field in our log database to track attempts for Sentinel, Landsat AC and Landsat MGRS processing. This limit is intended to prevent infinitely retrying a job where there is an underlying issue with the granule that prevents it from being processed (invalid or corrupted data).

Currently, the run_count field is incremented for each update to the relevant log table. We don't distinguish between a failure caused by an underlying granule issue and a job which has failed due to its SPOT instance being interrupted. Normally this is not an issue since we use a relatively high retry limit of 5 and the failures are only retried 1 or 2 times per 24 hour period depending on configuration. We have seen cases where we have had several successive SPOT interruptions and the run_count has exceeded the retry limit so jobs which should be retried again but are not. These failures can be identified by using a query similar to

SELECT run_count, count(*) FROM landsat_ac_log where jobinfo is not null and jobinfo->'Container'->>'ExitCode' is null group by run_count;

The run_count increment logic in the logging functions should be altered to not increment the field for jobs which have failed due to SPOT interruptions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions