Skip to content

[SPARK-52507][CORE] Attempt to read missing block from fallback storage #51202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

EnricoMi
Copy link
Contributor

What changes were proposed in this pull request?

On the presence of a fallback storage, ShuffleBlockFetcherIterator seeing a fetch failure can optimistically try to read a block from the fallback storage, as it might have been migrated from a decommissioned executor to the fallback storage. If storage migration happens only to the fallback storage (#51201), then this assumption is even more optimistic.

Note: This optimistic attempt to find the missing shuffle data on the fallback storage would collide with some replication delay handled in #51200.

Why are the changes needed?

In a Kubernetes environment, executors may be decommissioned. With a fallback storage configured, shuffle data will be migrated to other executors or the fallback storage. Tasks that start during a decommissioning phase of another executor might read blocks from that executor after it has been decommissioned. The task does not know the new location of the migrated block. Given a fallback storage is configured, it could optimistically try to read the block from the fallback storage.

This avoids a stage retry, which otherwise is an expensive way to fetch the current block address after a block migration.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual testing in a Kubernetes setup.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Jun 17, 2025
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-52507][K8S] Attempt to read missing block from fallback storage [SPARK-52507][CORE] Attempt to read missing block from fallback storage Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant