Skip to content

Complete monthly, manual RDS maintenance. #301

@sharkinsspatial

Description

@sharkinsspatial

A recent investigation of some missing scenes discovered by @madhuksridhar revealed that on several days during November we hit the same RDS scaling issue that occurred during our historical processing https://github.com/NASA-IMPACT/hls_development/issues/232. This occurs when https://github.com/NASA-IMPACT/hls-orchestration/blob/dev/lambda_functions/sentinel_ac_logger.py times out writing to the Aurora Serverless API endpoint so if an S30 job fails (normally due to Spot market interruption) it is not properly logged. If the failed job is not properly logged, the error won't be handled by the S30 error reprocessing function https://github.com/NASA-IMPACT/hls-orchestration/blob/dev/lambda_functions/process_sentinel_errors.py.

This issue is occurring because of a lack of indexes on the granule field and a large number of "successful" rows in the table causing the look up of the record to update to be slow causing the Aurora Serverless API call to time out.

To alleviate this in the short term, we should remove "successful" rows from the sentinel_log, landsat_ac_log and landsat_mgrs_log tables on a monthly basis and rebuild the appropriate indexes.

I've created a recurring calendar event in order to meet to do this the first Tuesday of every month.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions