Skip to content

do not merge -- Mg/try lance schema changes #4975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

malcolmgreaves
Copy link
Contributor

DO NOT MERGE

sunxin.ddupg and others added 2 commits August 13, 2025 17:52
@malcolmgreaves malcolmgreaves marked this pull request as draft August 14, 2025 00:53
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR refactors the Daft codebase to improve Lance integration by making schema handling more robust. The main changes include:

  1. DaftExtension Class Refactoring: The DaftExtension class in daft/datatype.py is moved from being defined inside the _ensure_registered_super_ext_type() function to module level. This improves code organization and makes the PyArrow extension type accessible throughout the module without requiring the registration function to be called first.

  2. Enhanced Schema Preservation in Lance Data Sink: The Lance data sink (daft/io/lance/lance_data_sink.py) is modified to explicitly reconstruct Arrow tables using the stored PyArrow schema. Instead of directly using micropartition.to_arrow(), it now uses pa.Table.from_batches() with the stored schema to ensure metadata preservation, particularly important for Lance-specific encodings like compression and blob storage.

  3. Blob Storage Test Coverage: A new test test_lancedb_write_blob validates the blob encoding functionality, ensuring that binary data with Lance-specific metadata can be written and retrieved correctly using Lance's specialized blob API.

These changes collectively improve the integration between Daft and Lance by ensuring schema consistency and supporting advanced Lance features like blob encoding for efficient binary data storage.

Confidence score: 4/5

  • This PR is generally safe to merge with some attention to the schema handling changes
  • Score reflects solid refactoring and test coverage, but lowered due to schema reconstruction logic that could impact data fidelity
  • Pay close attention to daft/io/lance/lance_data_sink.py for potential schema compatibility issues

3 files reviewed, 1 comment

Edit Code Review Bot Settings | Greptile

blobs_data = [b"foo", b"bar", b"baz"]
df = daft.from_pydict({"blob": blobs_data})

df.write_lance(lance_dataset_path, schema=daft.schema.Schema.from_pyarrow_schema(schema))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Missing mode parameter - should specify mode='create' for consistency with other tests

Copy link

codspeed-hq bot commented Aug 14, 2025

CodSpeed Performance Report

Merging #4975 will degrade performances by 53.86%

Comparing mg/try_lance_schema_changes (fe9db2e) with main (0129008)

Summary

❌ 1 regressions
✅ 23 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_iter_rows_first_row[1 Small File] 124.7 ms 270.2 ms -53.86%

Copy link

codecov bot commented Aug 14, 2025

Codecov Report

❌ Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.15%. Comparing base (0129008) to head (fe9db2e).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
daft/datatype.py 92.30% 1 Missing ⚠️
daft/io/lance/lance_data_sink.py 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4975      +/-   ##
==========================================
- Coverage   79.59%   79.15%   -0.44%     
==========================================
  Files         917      917              
  Lines      127014   127016       +2     
==========================================
- Hits       101094   100544     -550     
- Misses      25920    26472     +552     
Files with missing lines Coverage Δ
daft/datatype.py 94.39% <92.30%> (-0.19%) ⬇️
daft/io/lance/lance_data_sink.py 82.81% <50.00%> (ø)

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@malcolmgreaves malcolmgreaves changed the title Mg/try lance schema changes do not merge -- Mg/try lance schema changes Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants