Complete monthly, manual RDS maintenance.

A recent investigation of some missing scenes discovered by @madhuksridhar revealed that on several days during November we hit the same RDS scaling issue that occurred during our historical processing https://github.com/NASA-IMPACT/hls_development/issues/232.  This occurs when https://github.com/NASA-IMPACT/hls-orchestration/blob/dev/lambda_functions/sentinel_ac_logger.py times out writing to the Aurora Serverless API endpoint so if an S30 job fails (normally due to Spot market interruption) it is not properly logged.  If the failed job is not properly logged, the error won't be handled by the S30 error reprocessing function https://github.com/NASA-IMPACT/hls-orchestration/blob/dev/lambda_functions/process_sentinel_errors.py.

This issue is occurring because of a lack of indexes on the `granule` field and a large number of "successful" rows in the table causing the look up of the record to update to be slow causing the Aurora Serverless API call to time out.

To alleviate this in the short term, we should remove "successful" rows from the `sentinel_log`, `landsat_ac_log` and `landsat_mgrs_log` tables on a monthly basis and rebuild the appropriate indexes.

I've created a recurring calendar event in order to meet to do this the first Tuesday of every month.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Complete monthly, manual RDS maintenance. #301

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Complete monthly, manual RDS maintenance. #301

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions