Skip to content

fix: ignore NotFound error of the non-first list during iter dir #4891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 15, 2025

Conversation

stayrascal
Copy link
Contributor

@stayrascal stayrascal commented Aug 1, 2025

Changes Made

Ignore the NotFound error during iter dir if got empty response after the first list operation.

Related Issues

image As the above picture shows, globed the parent folder got a `Not found` error, but globed the sub folder succeed.

The Not Found error was thrown during the second list request with next continue token from the previous list response, but got empty response, and then throw Not Found error, and then pop up the error to downstream, cause the glob process failed.

The reason why got empty response of the second list request with a continue token from prev response is that the listV2 of S3-like object store is trying to solve the timeout problem of listing behavior comparing to listv1.
Assume we are trying to list 1000 keys among abundant objects, especially if we did much delete operations before listing which might lead to delete holes problem via delete tombstone that will impact the list performance.

  • The listV1 might be hanged and waiting for 1000 objects return in this situation and then the list http request will failed since timeout.
  • the listV2 will try to retrieve objects matched the request condition as much as possible within timeout, which means the response objects might less than 1000, the response contains the next continue token and is_truncated=true flag indicate the response is truncate and then client should continue to list the remaining objects, but listv2 doesn't ensure the existence of remaining objects because the previous scan operation of kv store is cut down, so the later list request might get empty response.

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the fix label Aug 1, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR fixes a race condition in the S3-like object store directory iteration functionality within Daft's I/O layer. The change modifies the iter_dir method in src/daft-io/src/object_io.rs to handle NotFound errors that can occur during paginated listing operations.

The core issue stems from S3's ListObjectsV2 API behavior. When dealing with large object stores that have experienced many delete operations, the API may return continuation tokens even when no more objects remain to be listed. This happens because S3's ListObjectsV2 was designed to prevent timeouts by returning partial results within time limits, but the presence of "delete tombstones" (markers for deleted objects) can cause subsequent requests with continuation tokens to return empty results and throw NotFound errors.

The fix wraps the continuation token-based listing calls in error handling logic that specifically catches NotFound errors during pagination and treats them as an indication that there are no more objects to list, rather than propagating them as failures. This allows the directory iteration to complete successfully by breaking out of the pagination loop when encountering these race conditions.

This change is part of Daft's object storage abstraction layer, which provides unified access to various cloud storage systems. The modification ensures that glob operations and other directory-based file discovery operations remain robust when working with S3-compatible storage systems that exhibit this specific API behavior.

PR Description Notes:

  • Minor typo: "contine token" should be "continue token"
  • Missing space in "listv1" and "listv2" (should be "list v1" and "list v2")

Confidence score: 4/5

  • This is a targeted fix for a well-understood S3 API behavior issue with clear error handling logic.
  • The change appropriately handles the specific race condition without affecting normal operation paths.
  • The error handling logic is conservative and only ignores NotFound errors during continuation token requests, not initial requests.

1 file reviewed, 1 comment

Edit Code Review Bot Settings | Greptile

stayrascal and others added 2 commits August 1, 2025 21:16
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@desmondcheongzx desmondcheongzx self-requested a review August 4, 2025 16:31
Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I have some thoughts here. I want to be careful that we're not just papering over some incorrect code we have at a lower level.

},
Err(err) => {
if matches!(err, super::Error::NotFound { .. }) {
continuation_token = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little digging to see where the NotFound error originates from. It seems we're throwing it ourselves in S3LikeSource's ls method.

What strikes me as a little odd is that we ignore the possibility that the request could have a continuation token when the error is thrown. There's no reason that we can't have a series of empty responses with continuation tokens.

Here's another proposal. Let's handle all continuation tokens to ensure proper pagination. After all pagination is handled, if we didn't hit a result, we return a NotFound error. So something like:

diff --git a/src/daft-io/src/object_io.rs b/src/daft-io/src/object_io.rs
index 06ce45b7f..d879f246f 100644
--- a/src/daft-io/src/object_io.rs
+++ b/src/daft-io/src/object_io.rs
@@ -227,9 +227,11 @@ pub trait ObjectSource: Sync + Send {
         io_stats: Option<IOStatsRef>,
     ) -> super::Result<BoxStream<super::Result<FileMetadata>>> {
         let uri = uri.to_string();
+        let mut found_any = false;
         let s = stream! {
             let lsr = self.ls(&uri, posix, None, page_size, io_stats.clone()).await?;
             for fm in lsr.files {
+                found_any = true;
                 yield Ok(fm);
             }
 
@@ -238,9 +240,17 @@ pub trait ObjectSource: Sync + Send {
                 let lsr = self.ls(&uri, posix, continuation_token.as_deref(), page_size, io_stats.clone()).await?;
                 continuation_token.clone_from(&lsr.continuation_token);
                 for fm in lsr.files {
+                    found_any = true;
                     yield Ok(fm);
                 }
             }
+
+            if !found_any {
+                yield Err(super::Error::NotFound {
+                    path: uri,
+                    source: Box::new(std::io::Error::new(std::io::ErrorKind::NotFound, "Path not found")),
+                });
+            }
         };
         Ok(s.boxed())
     }
diff --git a/src/daft-io/src/s3_like.rs b/src/daft-io/src/s3_like.rs
index 2b6aa15b7..376d38e7a 100644
--- a/src/daft-io/src/s3_like.rs
+++ b/src/daft-io/src/s3_like.rs
@@ -1276,7 +1269,7 @@ impl ObjectSource for S3LikeSource {
                 is.mark_list_requests(1);
             }
 
-            if lsr.files.is_empty() && key.contains(S3_DELIMITER) {
+            if lsr.files.is_empty() && lsr.continuation_token.is_none() && key.ends_with(S3_DELIMITER) {
                 let permit = self
                     .connection_pool_sema
                     .acquire()
@@ -1301,11 +1294,6 @@ impl ObjectSource for S3LikeSource {
                 }
                 let target_path = format!("{scheme}://{bucket}/{key}");
                 lsr.files.retain(|f| f.filepath == target_path);
-
-                if lsr.files.is_empty() {
-                    // Isn't a file or a directory
-                    return Err(Error::NotFound { path: path.into() }.into());
-                }
                 Ok(lsr)
             } else {
                 Ok(lsr)

Would this resolve the issue you're seeing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @desmondcheongzx for reviewing this PR.

I think your proposal is a workable solution, but it depends on what's the semantic/protocol of ls method of ObjectSource since we lack of clear documentation right now, especially for the the case that the path is not exist.

From the current implementation of each object store, the semantic/protocol of ls might be return NotFound error if the dir/file is not exist, so the caller to handle the type of Result.

If we change the semantic/protocol of ls method to return empty ListResult, all the object source implementation need to change the implementation logic and it's better to document the semantic/protocol of each method of ObjectSource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@desmondcheongzx may i check how do you think that whether we change the semantic of ls method to return empty ListResult instead of return NotFound error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stayrascal ideally I'd like to fix ls's behaviour. But I think you're right that because of the lack of documentation and the different object sources, this is a larger task. And I don't want this to keep blocking you from getting the fix you need (I do apologize for the delay here).

How about this - tbh I think the current change is safe enough to merge as is. We can create another ticket to solve the deeper issue, but let's unblock you for now.

Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the discussion above, this fix looks safe. I do think there's a deeper issue to fix, which we can track here (#4982), but let's not keep blocking this fix.

@desmondcheongzx desmondcheongzx enabled auto-merge (squash) August 15, 2025 06:52
@desmondcheongzx desmondcheongzx merged commit fe0d4bf into Eventual-Inc:main Aug 15, 2025
94 of 96 checks passed
Copy link

codecov bot commented Aug 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.71%. Comparing base (105580b) to head (8362bd6).
⚠️ Report is 59 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4891      +/-   ##
==========================================
- Coverage   79.46%   76.71%   -2.75%     
==========================================
  Files         896      918      +22     
  Lines      125473   127428    +1955     
==========================================
- Hits        99702    97758    -1944     
- Misses      25771    29670    +3899     
Files with missing lines Coverage Δ
src/daft-io/src/object_io.rs 76.06% <100.00%> (+3.25%) ⬆️

... and 259 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants