Skip to content

fix: lance schema does not work #4940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ddupg
Copy link

@ddupg ddupg commented Aug 8, 2025

Changes Made

Apply the specified schema to Lance's Fragments.

Related Issues

Closes #4939

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

Change-Id: I0e50a78624abe52f9878600f77c4da99bbc45620
@github-actions github-actions bot added the fix label Aug 8, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR fixes a critical issue with Lance blob encoding in Daft's Lance integration. The problem was that when writing DataFrames to Lance datasets with schema metadata specifying lance-encoding:blob, the metadata was being lost during the write process, causing Lance's blob API to fail.

The core issue was in daft/io/lance/lance_data_sink.py where the write() method was using MicroPartition.to_arrow() to convert data to Arrow format. This method strips field metadata, which Lance requires for specialized encodings like blob storage. The fix reconstructs the Arrow table using pa.Table.from_batches() with the original PyArrow schema, preserving the critical metadata that Lance needs.

The PR also adds a comprehensive test in tests/io/lancedb/test_lancedb_writes.py that verifies the fix works correctly. The test creates a dataset with blob-encoded binary data, confirms the schema metadata is preserved after writing, and successfully reads the data using Lance's blob API.

This change integrates with Daft's existing I/O infrastructure by ensuring that when users provide a schema with Lance-specific metadata through df.write_lance(schema=...), that metadata is properly preserved through the entire write pipeline to the underlying Lance storage layer.

Confidence score: 4/5

  • This PR addresses a specific bug with a targeted fix that preserves existing functionality while enabling Lance blob encoding
  • Score reflects a well-tested fix with clear understanding of the root cause and appropriate solution
  • The lance_data_sink.py file requires careful attention to ensure the schema reconstruction doesn't introduce other Arrow compatibility issues

2 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@universalmind303 universalmind303 self-requested a review August 8, 2025 15:48
@ddupg
Copy link
Author

ddupg commented Aug 11, 2025

This failed unit test tests/io/test_roundtrip_embeddings.py::test_roundtrip_embedding appears to be related to Arrow ExtensionType conversion in Ray distributed environment. As I’m new to this area, I haven’t found a solution yet. Could @universalmind303 @malcolmgreaves @Jay-ju help me with this?

Copy link
Contributor

@malcolmgreaves malcolmgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ddupg! Thanks for the PR. I took a look at the failing test. Can you try to follow up with my suggestion?


df.write_lance(lance_dataset_path, schema=daft.schema.Schema.from_pyarrow_schema(schema))

import lance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to the top of the module with the other imports.

@@ -91,7 +92,7 @@ def write(self, micropartitions: Iterator[MicroPartition]) -> Iterator[WriteResu
lance = self._import_lance()

for micropartition in micropartitions:
arrow_table = micropartition.to_arrow()
arrow_table = pa.Table.from_batches(micropartition.to_arrow().to_batches(), self._pyarrow_schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the problem with this PR is this change may be disregarding the daft-specific schema information. In the roundtrip embeddings tests, we're specifically checking to make sure that we're serializing & deserializing fixed size arrays of a specific numeric type. We expect these values to always be accepted as / be returned as numpy arrays. (Not python lists).

The failing test's error on this PR is saying that the schema that's saved doesn't align to the one we expected to be saved. @ddupg can you please try to inspect why this is changing the schema information when saved in Daft? What's the difference between what was being saved beforehand and what's being saved now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micropartition.to_arrow calls RecordBatch.to_arrow_record_batch, whose definition is:

pa.RecordBatch.from_pydict({column.name(): column.to_arrow() for column in self.columns()})

I believe that column.to_arrow() has the schema information that we need to preserve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your PR should be trying to do something to add to this per-column schema information. Perhaps it should be something like "when writing in lance, for each column that's a binary type, add the lance-specific metadata without changing the existing arrow schema information for the column."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @malcolmgreaves so much for taking the time to look into this PR and share your suggestions! I truly appreciate the detailed guidance you’ve provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you have any other questions or want me to look at something else! Thank you for your contribution!🙏

@Jay-ju
Copy link
Contributor

Jay-ju commented Aug 13, 2025

@malcolmgreaves @ddupg #4968 Here I provided a PR. The main reason for the failure of the UT here is that the arrow type was renamed by the register method, so I restored it here.

@malcolmgreaves
Copy link
Contributor

Hi @Jay-ju ! Thanks for making that other PR! Is there a branch where you combine those changes with the changes of this PR to show that it passes the test?

@malcolmgreaves
Copy link
Contributor

I made #4975 to test out the changes of this PR and #4968 . If this works then we can merge #4968 fist and then merge this PR.

@Jay-ju
Copy link
Contributor

Jay-ju commented Aug 14, 2025

I made #4975 to test out the changes of this PR and #4968 . If this works then we can merge #4968 fist and then merge this PR.

@malcolmgreaves @ddupg is myColleague, so #4968 has already run UT on this PR and it can solve the problem.

@malcolmgreaves
Copy link
Contributor

Hi @Jay-ju @ddupg -- just an update here. We worked on this bug internally yesterday. We have another approach that we think solves the problem. We're still coming up with the final touches on exactly how we want to fix this problem. We'll be able to get either that or these original PRs merged very soon.

@ddupg
Copy link
Author

ddupg commented Aug 15, 2025

just an update here. We worked on this bug internally yesterday. We have another approach that we think solves the problem. We're still coming up with the final touches on exactly how we want to fix this problem. We'll be able to get either that or these original PRs merged very soon.

Thank @malcolmgreaves for the update! I really appreciate your team’s efforts in addressing this issue. Looking forward to what's next!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lance schema does not work
3 participants