-
Notifications
You must be signed in to change notification settings - Fork 253
fix: lance schema does not work #4940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Change-Id: I0e50a78624abe52f9878600f77c4da99bbc45620
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR fixes a critical issue with Lance blob encoding in Daft's Lance integration. The problem was that when writing DataFrames to Lance datasets with schema metadata specifying lance-encoding:blob
, the metadata was being lost during the write process, causing Lance's blob API to fail.
The core issue was in daft/io/lance/lance_data_sink.py
where the write()
method was using MicroPartition.to_arrow()
to convert data to Arrow format. This method strips field metadata, which Lance requires for specialized encodings like blob storage. The fix reconstructs the Arrow table using pa.Table.from_batches()
with the original PyArrow schema, preserving the critical metadata that Lance needs.
The PR also adds a comprehensive test in tests/io/lancedb/test_lancedb_writes.py
that verifies the fix works correctly. The test creates a dataset with blob-encoded binary data, confirms the schema metadata is preserved after writing, and successfully reads the data using Lance's blob API.
This change integrates with Daft's existing I/O infrastructure by ensuring that when users provide a schema with Lance-specific metadata through df.write_lance(schema=...)
, that metadata is properly preserved through the entire write pipeline to the underlying Lance storage layer.
Confidence score: 4/5
- This PR addresses a specific bug with a targeted fix that preserves existing functionality while enabling Lance blob encoding
- Score reflects a well-tested fix with clear understanding of the root cause and appropriate solution
- The lance_data_sink.py file requires careful attention to ensure the schema reconstruction doesn't introduce other Arrow compatibility issues
2 files reviewed, no comments
This failed unit test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ddupg! Thanks for the PR. I took a look at the failing test. Can you try to follow up with my suggestion?
|
||
df.write_lance(lance_dataset_path, schema=daft.schema.Schema.from_pyarrow_schema(schema)) | ||
|
||
import lance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to the top of the module with the other imports.
@@ -91,7 +92,7 @@ def write(self, micropartitions: Iterator[MicroPartition]) -> Iterator[WriteResu | |||
lance = self._import_lance() | |||
|
|||
for micropartition in micropartitions: | |||
arrow_table = micropartition.to_arrow() | |||
arrow_table = pa.Table.from_batches(micropartition.to_arrow().to_batches(), self._pyarrow_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the problem with this PR is this change may be disregarding the daft-specific schema information. In the roundtrip embeddings tests, we're specifically checking to make sure that we're serializing & deserializing fixed size arrays of a specific numeric type. We expect these values to always be accepted as / be returned as numpy
arrays. (Not python list
s).
The failing test's error on this PR is saying that the schema that's saved doesn't align to the one we expected to be saved. @ddupg can you please try to inspect why this is changing the schema information when saved in Daft? What's the difference between what was being saved beforehand and what's being saved now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micropartition.to_arrow
calls RecordBatch.to_arrow_record_batch
, whose definition is:
pa.RecordBatch.from_pydict({column.name(): column.to_arrow() for column in self.columns()})
I believe that column.to_arrow()
has the schema information that we need to preserve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your PR should be trying to do something to add to this per-column schema information. Perhaps it should be something like "when writing in lance, for each column that's a binary
type, add the lance-specific metadata without changing the existing arrow schema information for the column."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @malcolmgreaves so much for taking the time to look into this PR and share your suggestions! I truly appreciate the detailed guidance you’ve provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you have any other questions or want me to look at something else! Thank you for your contribution!🙏
@malcolmgreaves @ddupg #4968 Here I provided a PR. The main reason for the failure of the UT here is that the arrow type was renamed by the register method, so I restored it here. |
Hi @Jay-ju ! Thanks for making that other PR! Is there a branch where you combine those changes with the changes of this PR to show that it passes the test? |
Hi @Jay-ju @ddupg -- just an update here. We worked on this bug internally yesterday. We have another approach that we think solves the problem. We're still coming up with the final touches on exactly how we want to fix this problem. We'll be able to get either that or these original PRs merged very soon. |
Thank @malcolmgreaves for the update! I really appreciate your team’s efforts in addressing this issue. Looking forward to what's next! |
Changes Made
Apply the specified schema to Lance's Fragments.
Related Issues
Closes #4939
Checklist
docs/mkdocs.yml
navigation