Handle SPOT market interruptions differently than job failures.

Currently we set a retry limit for batch processing jobs and increment a `run_count` field in our log database to track attempts for Sentinel, Landsat AC and Landsat MGRS processing.  This limit is intended to prevent infinitely retrying a job where there is an underlying issue with the granule that prevents it from being processed (invalid or corrupted data).

Currently, the `run_count` field is incremented for each update to the relevant log table.  We don't distinguish between a failure caused by an underlying granule issue and a job which has failed due to its SPOT instance being interrupted.  Normally this is not an issue since we use a relatively high retry limit of 5 and the failures are only retried 1 or 2 times per 24 hour period depending on configuration.  We have seen cases where we have had several successive SPOT interruptions and the `run_count` has exceeded the retry limit so jobs which should be retried again but are not.  These failures can be identified by using a query similar to 
```
SELECT run_count, count(*) FROM landsat_ac_log where jobinfo is not null and jobinfo->'Container'->>'ExitCode' is null group by run_count;
```

The `run_count` increment logic in the logging functions should be altered to not increment the field for jobs which have failed due to SPOT interruptions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle SPOT market interruptions differently than job failures. #212

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle SPOT market interruptions differently than job failures. #212

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions