-
Notifications
You must be signed in to change notification settings - Fork 253
do not merge -- Mg/try lance schema changes #4975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Change-Id: I0e50a78624abe52f9878600f77c4da99bbc45620
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR refactors the Daft codebase to improve Lance integration by making schema handling more robust. The main changes include:
-
DaftExtension Class Refactoring: The
DaftExtension
class indaft/datatype.py
is moved from being defined inside the_ensure_registered_super_ext_type()
function to module level. This improves code organization and makes the PyArrow extension type accessible throughout the module without requiring the registration function to be called first. -
Enhanced Schema Preservation in Lance Data Sink: The Lance data sink (
daft/io/lance/lance_data_sink.py
) is modified to explicitly reconstruct Arrow tables using the stored PyArrow schema. Instead of directly usingmicropartition.to_arrow()
, it now usespa.Table.from_batches()
with the stored schema to ensure metadata preservation, particularly important for Lance-specific encodings like compression and blob storage. -
Blob Storage Test Coverage: A new test
test_lancedb_write_blob
validates the blob encoding functionality, ensuring that binary data with Lance-specific metadata can be written and retrieved correctly using Lance's specialized blob API.
These changes collectively improve the integration between Daft and Lance by ensuring schema consistency and supporting advanced Lance features like blob encoding for efficient binary data storage.
Confidence score: 4/5
- This PR is generally safe to merge with some attention to the schema handling changes
- Score reflects solid refactoring and test coverage, but lowered due to schema reconstruction logic that could impact data fidelity
- Pay close attention to
daft/io/lance/lance_data_sink.py
for potential schema compatibility issues
3 files reviewed, 1 comment
blobs_data = [b"foo", b"bar", b"baz"] | ||
df = daft.from_pydict({"blob": blobs_data}) | ||
|
||
df.write_lance(lance_dataset_path, schema=daft.schema.Schema.from_pyarrow_schema(schema)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Missing mode parameter - should specify mode='create' for consistency with other tests
CodSpeed Performance ReportMerging #4975 will degrade performances by 53.86%Comparing Summary
Benchmarks breakdown
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4975 +/- ##
==========================================
- Coverage 79.59% 79.15% -0.44%
==========================================
Files 917 917
Lines 127014 127016 +2
==========================================
- Hits 101094 100544 -550
- Misses 25920 26472 +552
🚀 New features to boost your workflow:
|
DO NOT MERGE