-
Notifications
You must be signed in to change notification settings - Fork 0
Description
A recent investigation of some missing scenes discovered by @madhuksridhar revealed that on several days during November we hit the same RDS scaling issue that occurred during our historical processing https://github.com/NASA-IMPACT/hls_development/issues/232. This occurs when https://github.com/NASA-IMPACT/hls-orchestration/blob/dev/lambda_functions/sentinel_ac_logger.py times out writing to the Aurora Serverless API endpoint so if an S30 job fails (normally due to Spot market interruption) it is not properly logged. If the failed job is not properly logged, the error won't be handled by the S30 error reprocessing function https://github.com/NASA-IMPACT/hls-orchestration/blob/dev/lambda_functions/process_sentinel_errors.py.
This issue is occurring because of a lack of indexes on the granule
field and a large number of "successful" rows in the table causing the look up of the record to update to be slow causing the Aurora Serverless API call to time out.
To alleviate this in the short term, we should remove "successful" rows from the sentinel_log
, landsat_ac_log
and landsat_mgrs_log
tables on a monthly basis and rebuild the appropriate indexes.
I've created a recurring calendar event in order to meet to do this the first Tuesday of every month.