[SPARK-52508][CORE] Fallback storage retries FileNotFoundExceptions #51200
+150
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Adds options to retry
FileNotFoundException
s when opening files migrated to the fallback storage.STORAGE_DECOMMISSION_FALLBACK_STORAGE_REPLICATION_DELAY
sets the allowed replication delay.The executor waits at most this long for the shuffle data file to appear on the fallback storage
STORAGE_DECOMMISSION_FALLBACK_STORAGE_REPLICATION_WAIT
sets an interval of re-attempts looking for the fileWhy are the changes needed?
Using a distributed filesystem as the fallback storage for migrating shuffle data on executor decommissioning, executors that attempt to read the migrated data might not yet see the file that has been written by the decommissioned executor. This is called replication delay.
Currently, executors give up instantly, even though they know the data have been successfully migrated to the fallback storage, from where they do not migrate further. Having the executor wait for a defined time and reattempt to open the file avoids a fetch failure and a re-computation of the migrated shuffle data.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test.
Was this patch authored or co-authored using generative AI tooling?
No