[SPARK-52507][CORE] Attempt to read missing block from fallback storage #51202
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
On the presence of a fallback storage,
ShuffleBlockFetcherIterator
seeing a fetch failure can optimistically try to read a block from the fallback storage, as it might have been migrated from a decommissioned executor to the fallback storage. If storage migration happens only to the fallback storage (#51201), then this assumption is even more optimistic.Note: This optimistic attempt to find the missing shuffle data on the fallback storage would collide with some replication delay handled in #51200.
Why are the changes needed?
In a Kubernetes environment, executors may be decommissioned. With a fallback storage configured, shuffle data will be migrated to other executors or the fallback storage. Tasks that start during a decommissioning phase of another executor might read blocks from that executor after it has been decommissioned. The task does not know the new location of the migrated block. Given a fallback storage is configured, it could optimistically try to read the block from the fallback storage.
This avoids a stage retry, which otherwise is an expensive way to fetch the current block address after a block migration.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test and manual testing in a Kubernetes setup.
Was this patch authored or co-authored using generative AI tooling?
No