[SPARK-52507][CORE] Attempt to read missing block from fallback storage #51202

EnricoMi · 2025-06-17T09:08:03Z

What changes were proposed in this pull request?

On the presence of a fallback storage, ShuffleBlockFetcherIterator seeing a fetch failure can optimistically try to read a block from the fallback storage, as it might have been migrated from a decommissioned executor to the fallback storage. If storage migration happens only to the fallback storage (#51201), then this assumption is even more optimistic.

Note: This optimistic attempt to find the missing shuffle data on the fallback storage would collide with some replication delay handled in #51200.

Why are the changes needed?

In a Kubernetes environment, executors may be decommissioned. With a fallback storage configured, shuffle data will be migrated to other executors or the fallback storage. Tasks that start during a decommissioning phase of another executor might read blocks from that executor after it has been decommissioned. The task does not know the new location of the migrated block. Given a fallback storage is configured, it could optimistically try to read the block from the fallback storage.

This avoids a stage retry, which otherwise is an expensive way to fetch the current block address after a block migration.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual testing in a Kubernetes setup.

Was this patch authored or co-authored using generative AI tooling?

No

…l FallbackStorage instance, add test

EnricoMi added 2 commits June 12, 2025 16:08

Attemp to fetch failed block from FallbackStorage

f03f86a

Replace SparkEnv.get.conf in ShuffleBlockFetcherIterator with optiona…

b7c2890

…l FallbackStorage instance, add test

github-actions bot added the CORE label Jun 17, 2025

dongjoon-hyun changed the title ~~[SPARK-52507][K8S] Attempt to read missing block from fallback storage~~ [SPARK-52507][CORE] Attempt to read missing block from fallback storage Jun 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52507][CORE] Attempt to read missing block from fallback storage #51202

[SPARK-52507][CORE] Attempt to read missing block from fallback storage #51202

Uh oh!

EnricoMi commented Jun 17, 2025

Uh oh!

Uh oh!

[SPARK-52507][CORE] Attempt to read missing block from fallback storage #51202

Are you sure you want to change the base?

[SPARK-52507][CORE] Attempt to read missing block from fallback storage #51202

Uh oh!

Conversation

EnricoMi commented Jun 17, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!