Skip to content

feat: new daft.File datatype #4959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Aug 15, 2025
Merged

feat: new daft.File datatype #4959

merged 38 commits into from
Aug 15, 2025

Conversation

universalmind303
Copy link
Contributor

@universalmind303 universalmind303 commented Aug 12, 2025

Changes Made

example usage:

# %%

import daft
import daft.functions as F

df = daft.from_glob_path("~Daft/src/**/*.rs")
df = df.select(F.to_file(df["path"]).alias("data"))

@daft.func
def count_lines(file: daft.File) -> int:
    with file as f:
        return len(f.readlines())



df = df.select(count_lines(df["data"]).alias("n_lines"), df["data"]).sort("n_lines", desc=True)
df.show()

TODOS:

  • improve the PyDaftFile implementation to use our native IO file readers instead of naive approach
  • remove remaining todos.
  • add tests
  • add documentation for daft.File and F.to_file

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Aug 12, 2025
Comment on lines 27 to 42
fn _from_path(path: String) -> PyResult<Self> {
if !path.starts_with("file://") {
return Err(PyNotImplementedError::new_err(
"remote protocols not yet supported!",
));
}
let path = path.replace("file://", "");

let file = File::open(Path::new(&path))
.map_err(|e| PyIOError::new_err(format!("Failed to open file {}: {}", path, e)))?;
Ok(Self {
path: Some(path),
cursor: Some(FileCursor::File(file)),
position: 0,
})
}
Copy link
Contributor Author

@universalmind303 universalmind303 Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i wanted to keep the scope of this PR super small, and as such, this has a super naive file reader implementation. I want to support:

def my_func(file: daft.File):
  with file.open(io_config) as f:

but I think the abstractions defined in this PR will make it relatively easy to add more protocols without breaking the user interface.

Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 65.42923% with 149 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.89%. Comparing base (072858e) to head (a2b09b1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-core/src/series/from_lit.rs 0.00% 24 Missing ⚠️
daft/file.py 62.90% 23 Missing ⚠️
src/common/file/src/lib.rs 20.00% 20 Missing ⚠️
src/daft-file/src/python.rs 85.71% 17 Missing ⚠️
src/daft-core/src/array/ops/repr.rs 0.00% 12 Missing ⚠️
src/daft-dsl/src/python.rs 0.00% 10 Missing ⚠️
src/daft-schema/src/dtype.rs 30.76% 9 Missing ⚠️
src/daft-core/src/array/ops/cast.rs 0.00% 6 Missing ⚠️
src/daft-core/src/lit/mod.rs 0.00% 5 Missing ⚠️
src/daft-functions/src/lib.rs 85.18% 4 Missing ⚠️
... and 10 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4959      +/-   ##
==========================================
+ Coverage   76.29%   76.89%   +0.59%     
==========================================
  Files         918      922       +4     
  Lines      128703   127327    -1376     
==========================================
- Hits        98195    97906     -289     
+ Misses      30508    29421    -1087     
Files with missing lines Coverage Δ
daft/functions/__init__.py 100.00% <100.00%> (ø)
daft/functions/misc.py 100.00% <100.00%> (ø)
...c/daft-core/src/array/growable/logical_growable.rs 90.62% <ø> (+33.86%) ⬆️
src/daft-core/src/array/growable/mod.rs 100.00% <ø> (ø)
src/daft-core/src/array/ops/file.rs 100.00% <100.00%> (ø)
src/daft-core/src/array/ops/from_arrow.rs 85.27% <ø> (ø)
src/daft-core/src/array/ops/get.rs 82.05% <100.00%> (+1.28%) ⬆️
src/daft-core/src/array/ops/take.rs 99.41% <ø> (ø)
src/daft-core/src/datatypes/logical.rs 80.61% <ø> (ø)
src/daft-core/src/datatypes/mod.rs 40.00% <ø> (ø)
... and 25 more

... and 33 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@universalmind303 universalmind303 marked this pull request as ready for review August 13, 2025 18:45
@universalmind303
Copy link
Contributor Author

@samster25 @kevinzwang @rchowell, this PR should be ready for review now! 📁 🚀

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces a new daft.File datatype that enables working with file contents as first-class values in Daft DataFrames. The implementation provides two variants: PathFile for filesystem references and MemoryFile for in-memory binary data, both accessible through a unified file-like interface.

The core architecture centers around a DaftFile enum with Reference(String) and Data(Vec<u8>) variants, implemented as a logical type backed by a StructType with discriminant and data fields. This follows Daft's established pattern for complex datatypes like ImageType and TensorType. The implementation spans multiple layers:

  • Core types: FileType logical datatype, FileArray for collections, and Literal::File for literals
  • Python interface: File base class with PathFile and MemoryFile subclasses providing standard file operations (read, seek, tell, close) and context manager support
  • Function integration: file() function (exposed as F.to_file()) converts string paths or binary data to File objects
  • Array operations: Full integration with Daft's array system including growable arrays, take operations, casting, and sorting

The feature enables users to create DataFrames containing file objects and apply UDFs that operate directly on file contents, as demonstrated in the PR example where files are loaded from glob patterns and processed to count lines. The implementation includes comprehensive test coverage and follows existing codebase patterns for type registration, serialization, and Python bindings.

Confidence score: 1/5

  • This PR contains multiple critical issues that will cause immediate failures including build errors from incorrect dependency paths, runtime panics from incomplete implementations, and unsafe type casting
  • Score reflects numerous TODO placeholders in critical code paths, unsafe unwrap() usage throughout, and fundamental compilation issues that prevent the feature from working
  • Pay close attention to Cargo.toml dependency path errors, panic-inducing TODO implementations in display/serialization code, and unsafe type conversions in src/daft-file/src/python.rs

46 files reviewed, 19 comments

Edit Code Review Bot Settings | Greptile

universalmind303 and others added 4 commits August 13, 2025 11:59
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@universalmind303 universalmind303 changed the title feat: WIP: new daft.File datatype feat: new daft.File datatype Aug 13, 2025
universalmind303 and others added 3 commits August 13, 2025 12:14
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@universalmind303 universalmind303 enabled auto-merge (squash) August 14, 2025 23:47
@universalmind303 universalmind303 merged commit 14ce5e8 into main Aug 15, 2025
54 checks passed
@universalmind303 universalmind303 deleted the cory/daft-file branch August 15, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants