Skip to content

[SPARK-52508][CORE] Fallback storage retries FileNotFoundExceptions #51200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

EnricoMi
Copy link
Contributor

What changes were proposed in this pull request?

Adds options to retry FileNotFoundExceptions when opening files migrated to the fallback storage.

  • STORAGE_DECOMMISSION_FALLBACK_STORAGE_REPLICATION_DELAY sets the allowed replication delay.
    The executor waits at most this long for the shuffle data file to appear on the fallback storage
  • STORAGE_DECOMMISSION_FALLBACK_STORAGE_REPLICATION_WAIT sets an interval of re-attempts looking for the file

Why are the changes needed?

Using a distributed filesystem as the fallback storage for migrating shuffle data on executor decommissioning, executors that attempt to read the migrated data might not yet see the file that has been written by the decommissioned executor. This is called replication delay.

Currently, executors give up instantly, even though they know the data have been successfully migrated to the fallback storage, from where they do not migrate further. Having the executor wait for a defined time and reattempt to open the file avoids a fetch failure and a re-computation of the migrated shuffle data.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Jun 17, 2025
@EnricoMi EnricoMi changed the title [SPARK-52508][K8S] FallbackStorage retries FileNotFoundExceptions [SPARK-52508][K8S] Fallback storage retries FileNotFoundExceptions Jun 17, 2025
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-52508][K8S] Fallback storage retries FileNotFoundExceptions [SPARK-52508][CORE] Fallback storage retries FileNotFoundExceptions Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant