-
Notifications
You must be signed in to change notification settings - Fork 253
feat: new daft.File
datatype
#4959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/daft-file/src/python.rs
Outdated
fn _from_path(path: String) -> PyResult<Self> { | ||
if !path.starts_with("file://") { | ||
return Err(PyNotImplementedError::new_err( | ||
"remote protocols not yet supported!", | ||
)); | ||
} | ||
let path = path.replace("file://", ""); | ||
|
||
let file = File::open(Path::new(&path)) | ||
.map_err(|e| PyIOError::new_err(format!("Failed to open file {}: {}", path, e)))?; | ||
Ok(Self { | ||
path: Some(path), | ||
cursor: Some(FileCursor::File(file)), | ||
position: 0, | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i wanted to keep the scope of this PR super small, and as such, this has a super naive file reader implementation. I want to support:
def my_func(file: daft.File):
with file.open(io_config) as f:
but I think the abstractions defined in this PR will make it relatively easy to add more protocols without breaking the user interface.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4959 +/- ##
==========================================
+ Coverage 76.29% 76.89% +0.59%
==========================================
Files 918 922 +4
Lines 128703 127327 -1376
==========================================
- Hits 98195 97906 -289
+ Misses 30508 29421 -1087
🚀 New features to boost your workflow:
|
@samster25 @kevinzwang @rchowell, this PR should be ready for review now! 📁 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces a new daft.File
datatype that enables working with file contents as first-class values in Daft DataFrames. The implementation provides two variants: PathFile
for filesystem references and MemoryFile
for in-memory binary data, both accessible through a unified file-like interface.
The core architecture centers around a DaftFile
enum with Reference(String)
and Data(Vec<u8>)
variants, implemented as a logical type backed by a StructType
with discriminant and data fields. This follows Daft's established pattern for complex datatypes like ImageType
and TensorType
. The implementation spans multiple layers:
- Core types:
FileType
logical datatype,FileArray
for collections, andLiteral::File
for literals - Python interface:
File
base class withPathFile
andMemoryFile
subclasses providing standard file operations (read, seek, tell, close) and context manager support - Function integration:
file()
function (exposed asF.to_file()
) converts string paths or binary data to File objects - Array operations: Full integration with Daft's array system including growable arrays, take operations, casting, and sorting
The feature enables users to create DataFrames containing file objects and apply UDFs that operate directly on file contents, as demonstrated in the PR example where files are loaded from glob patterns and processed to count lines. The implementation includes comprehensive test coverage and follows existing codebase patterns for type registration, serialization, and Python bindings.
Confidence score: 1/5
- This PR contains multiple critical issues that will cause immediate failures including build errors from incorrect dependency paths, runtime panics from incomplete implementations, and unsafe type casting
- Score reflects numerous TODO placeholders in critical code paths, unsafe unwrap() usage throughout, and fundamental compilation issues that prevent the feature from working
- Pay close attention to
Cargo.toml
dependency path errors, panic-inducing TODO implementations in display/serialization code, and unsafe type conversions insrc/daft-file/src/python.rs
46 files reviewed, 19 comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
daft.File
datatypedaft.File
datatype
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Changes Made
example usage:
TODOS:
PyDaftFile
implementation to use our native IO file readers instead of naive approachdaft.File
andF.to_file
Related Issues
Checklist
docs/mkdocs.yml
navigation