-
Notifications
You must be signed in to change notification settings - Fork 253
fix: ignore NotFound error of the non-first list during iter dir #4891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR fixes a race condition in the S3-like object store directory iteration functionality within Daft's I/O layer. The change modifies the iter_dir
method in src/daft-io/src/object_io.rs
to handle NotFound errors that can occur during paginated listing operations.
The core issue stems from S3's ListObjectsV2 API behavior. When dealing with large object stores that have experienced many delete operations, the API may return continuation tokens even when no more objects remain to be listed. This happens because S3's ListObjectsV2 was designed to prevent timeouts by returning partial results within time limits, but the presence of "delete tombstones" (markers for deleted objects) can cause subsequent requests with continuation tokens to return empty results and throw NotFound errors.
The fix wraps the continuation token-based listing calls in error handling logic that specifically catches NotFound errors during pagination and treats them as an indication that there are no more objects to list, rather than propagating them as failures. This allows the directory iteration to complete successfully by breaking out of the pagination loop when encountering these race conditions.
This change is part of Daft's object storage abstraction layer, which provides unified access to various cloud storage systems. The modification ensures that glob operations and other directory-based file discovery operations remain robust when working with S3-compatible storage systems that exhibit this specific API behavior.
PR Description Notes:
- Minor typo: "contine token" should be "continue token"
- Missing space in "listv1" and "listv2" (should be "list v1" and "list v2")
Confidence score: 4/5
- This is a targeted fix for a well-understood S3 API behavior issue with clear error handling logic.
- The change appropriately handles the specific race condition without affecting normal operation paths.
- The error handling logic is conservative and only ignores NotFound errors during continuation token requests, not initial requests.
1 file reviewed, 1 comment
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! I have some thoughts here. I want to be careful that we're not just papering over some incorrect code we have at a lower level.
}, | ||
Err(err) => { | ||
if matches!(err, super::Error::NotFound { .. }) { | ||
continuation_token = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a little digging to see where the NotFound error originates from. It seems we're throwing it ourselves in S3LikeSource
's ls
method.
What strikes me as a little odd is that we ignore the possibility that the request could have a continuation token when the error is thrown. There's no reason that we can't have a series of empty responses with continuation tokens.
Here's another proposal. Let's handle all continuation tokens to ensure proper pagination. After all pagination is handled, if we didn't hit a result, we return a NotFound error. So something like:
diff --git a/src/daft-io/src/object_io.rs b/src/daft-io/src/object_io.rs
index 06ce45b7f..d879f246f 100644
--- a/src/daft-io/src/object_io.rs
+++ b/src/daft-io/src/object_io.rs
@@ -227,9 +227,11 @@ pub trait ObjectSource: Sync + Send {
io_stats: Option<IOStatsRef>,
) -> super::Result<BoxStream<super::Result<FileMetadata>>> {
let uri = uri.to_string();
+ let mut found_any = false;
let s = stream! {
let lsr = self.ls(&uri, posix, None, page_size, io_stats.clone()).await?;
for fm in lsr.files {
+ found_any = true;
yield Ok(fm);
}
@@ -238,9 +240,17 @@ pub trait ObjectSource: Sync + Send {
let lsr = self.ls(&uri, posix, continuation_token.as_deref(), page_size, io_stats.clone()).await?;
continuation_token.clone_from(&lsr.continuation_token);
for fm in lsr.files {
+ found_any = true;
yield Ok(fm);
}
}
+
+ if !found_any {
+ yield Err(super::Error::NotFound {
+ path: uri,
+ source: Box::new(std::io::Error::new(std::io::ErrorKind::NotFound, "Path not found")),
+ });
+ }
};
Ok(s.boxed())
}
diff --git a/src/daft-io/src/s3_like.rs b/src/daft-io/src/s3_like.rs
index 2b6aa15b7..376d38e7a 100644
--- a/src/daft-io/src/s3_like.rs
+++ b/src/daft-io/src/s3_like.rs
@@ -1276,7 +1269,7 @@ impl ObjectSource for S3LikeSource {
is.mark_list_requests(1);
}
- if lsr.files.is_empty() && key.contains(S3_DELIMITER) {
+ if lsr.files.is_empty() && lsr.continuation_token.is_none() && key.ends_with(S3_DELIMITER) {
let permit = self
.connection_pool_sema
.acquire()
@@ -1301,11 +1294,6 @@ impl ObjectSource for S3LikeSource {
}
let target_path = format!("{scheme}://{bucket}/{key}");
lsr.files.retain(|f| f.filepath == target_path);
-
- if lsr.files.is_empty() {
- // Isn't a file or a directory
- return Err(Error::NotFound { path: path.into() }.into());
- }
Ok(lsr)
} else {
Ok(lsr)
Would this resolve the issue you're seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @desmondcheongzx for reviewing this PR.
I think your proposal is a workable solution, but it depends on what's the semantic/protocol of ls
method of ObjectSource
since we lack of clear documentation right now, especially for the the case that the path is not exist.
From the current implementation of each object store, the semantic/protocol of ls
might be return NotFound error if the dir/file is not exist, so the caller to handle the type of Result
.
If we change the semantic/protocol of ls
method to return empty ListResult, all the object source implementation need to change the implementation logic and it's better to document the semantic/protocol of each method of ObjectSource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@desmondcheongzx may i check how do you think that whether we change the semantic of ls
method to return empty ListResult instead of return NotFound error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stayrascal ideally I'd like to fix ls
's behaviour. But I think you're right that because of the lack of documentation and the different object sources, this is a larger task. And I don't want this to keep blocking you from getting the fix you need (I do apologize for the delay here).
How about this - tbh I think the current change is safe enough to merge as is. We can create another ticket to solve the deeper issue, but let's unblock you for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the discussion above, this fix looks safe. I do think there's a deeper issue to fix, which we can track here (#4982), but let's not keep blocking this fix.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4891 +/- ##
==========================================
- Coverage 79.46% 76.71% -2.75%
==========================================
Files 896 918 +22
Lines 125473 127428 +1955
==========================================
- Hits 99702 97758 -1944
- Misses 25771 29670 +3899
🚀 New features to boost your workflow:
|
Changes Made
Ignore the NotFound error during iter dir if got empty response after the first list operation.
Related Issues
The Not Found error was thrown during the second list request with next continue token from the previous list response, but got empty response, and then throw Not Found error, and then pop up the error to downstream, cause the glob process failed.
The reason why got empty response of the second list request with a continue token from prev response is that the listV2 of S3-like object store is trying to solve the timeout problem of listing behavior comparing to listv1.
Assume we are trying to list 1000 keys among abundant objects, especially if we did much delete operations before listing which might lead to delete holes problem via delete tombstone that will impact the list performance.
next continue token
andis_truncated=true
flag indicate the response is truncate and then client should continue to list the remaining objects, but listv2 doesn't ensure the existence of remaining objects because the previous scan operation of kv store is cut down, so the later list request might get empty response.Checklist
docs/mkdocs.yml
navigation