fix: lance schema does not work #4940

ddupg · 2025-08-08T11:28:06Z

Changes Made

Apply the specified schema to Lance's Fragments.

Related Issues

Closes #4939

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

Change-Id: I0e50a78624abe52f9878600f77c4da99bbc45620

greptile-apps

Greptile Summary

This PR fixes a critical issue with Lance blob encoding in Daft's Lance integration. The problem was that when writing DataFrames to Lance datasets with schema metadata specifying lance-encoding:blob, the metadata was being lost during the write process, causing Lance's blob API to fail.

The core issue was in daft/io/lance/lance_data_sink.py where the write() method was using MicroPartition.to_arrow() to convert data to Arrow format. This method strips field metadata, which Lance requires for specialized encodings like blob storage. The fix reconstructs the Arrow table using pa.Table.from_batches() with the original PyArrow schema, preserving the critical metadata that Lance needs.

The PR also adds a comprehensive test in tests/io/lancedb/test_lancedb_writes.py that verifies the fix works correctly. The test creates a dataset with blob-encoded binary data, confirms the schema metadata is preserved after writing, and successfully reads the data using Lance's blob API.

This change integrates with Daft's existing I/O infrastructure by ensuring that when users provide a schema with Lance-specific metadata through df.write_lance(schema=...), that metadata is properly preserved through the entire write pipeline to the underlying Lance storage layer.

Confidence score: 4/5

This PR addresses a specific bug with a targeted fix that preserves existing functionality while enabling Lance blob encoding
Score reflects a well-tested fix with clear understanding of the root cause and appropriate solution
The lance_data_sink.py file requires careful attention to ensure the schema reconstruction doesn't introduce other Arrow compatibility issues

_{2 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

ddupg · 2025-08-11T12:46:47Z

This failed unit test tests/io/test_roundtrip_embeddings.py::test_roundtrip_embedding appears to be related to Arrow ExtensionType conversion in Ray distributed environment. As I’m new to this area, I haven’t found a solution yet. Could @universalmind303 @malcolmgreaves @Jay-ju help me with this?

malcolmgreaves

Hi @ddupg! Thanks for the PR. I took a look at the failing test. Can you try to follow up with my suggestion?

malcolmgreaves · 2025-08-11T17:23:28Z

tests/io/lancedb/test_lancedb_writes.py

+
+    df.write_lance(lance_dataset_path, schema=daft.schema.Schema.from_pyarrow_schema(schema))
+
+    import lance


Please move this to the top of the module with the other imports.

malcolmgreaves · 2025-08-11T17:32:26Z

daft/io/lance/lance_data_sink.py

@@ -91,7 +92,7 @@ def write(self, micropartitions: Iterator[MicroPartition]) -> Iterator[WriteResu
        lance = self._import_lance()

        for micropartition in micropartitions:
-            arrow_table = micropartition.to_arrow()
+            arrow_table = pa.Table.from_batches(micropartition.to_arrow().to_batches(), self._pyarrow_schema)


I believe the problem with this PR is this change may be disregarding the daft-specific schema information. In the roundtrip embeddings tests, we're specifically checking to make sure that we're serializing & deserializing fixed size arrays of a specific numeric type. We expect these values to always be accepted as / be returned as numpy arrays. (Not python lists).

The failing test's error on this PR is saying that the schema that's saved doesn't align to the one we expected to be saved. @ddupg can you please try to inspect why this is changing the schema information when saved in Daft? What's the difference between what was being saved beforehand and what's being saved now?

Micropartition.to_arrow calls RecordBatch.to_arrow_record_batch, whose definition is:

pa.RecordBatch.from_pydict({column.name(): column.to_arrow() for column in self.columns()})

I believe that column.to_arrow() has the schema information that we need to preserve.

I think your PR should be trying to do something to add to this per-column schema information. Perhaps it should be something like "when writing in lance, for each column that's a binary type, add the lance-specific metadata without changing the existing arrow schema information for the column."

Thanks @malcolmgreaves so much for taking the time to look into this PR and share your suggestions! I truly appreciate the detailed guidance you’ve provided.

Let me know if you have any other questions or want me to look at something else! Thank you for your contribution!🙏

Jay-ju · 2025-08-13T11:25:12Z

@malcolmgreaves @ddupg #4968 Here I provided a PR. The main reason for the failure of the UT here is that the arrow type was renamed by the register method, so I restored it here.

malcolmgreaves · 2025-08-14T00:55:03Z

Hi @Jay-ju ! Thanks for making that other PR! Is there a branch where you combine those changes with the changes of this PR to show that it passes the test?

malcolmgreaves · 2025-08-14T00:55:53Z

I made #4975 to test out the changes of this PR and #4968 . If this works then we can merge #4968 fist and then merge this PR.

Jay-ju · 2025-08-14T01:21:49Z

I made #4975 to test out the changes of this PR and #4968 . If this works then we can merge #4968 fist and then merge this PR.

@malcolmgreaves @ddupg is myColleague, so #4968 has already run UT on this PR and it can solve the problem.

malcolmgreaves · 2025-08-15T19:54:47Z

Hi @Jay-ju @ddupg -- just an update here. We worked on this bug internally yesterday. We have another approach that we think solves the problem. We're still coming up with the final touches on exactly how we want to fix this problem. We'll be able to get either that or these original PRs merged very soon.

ddupg · 2025-08-15T23:38:06Z

just an update here. We worked on this bug internally yesterday. We have another approach that we think solves the problem. We're still coming up with the final touches on exactly how we want to fix this problem. We'll be able to get either that or these original PRs merged very soon.

Thank @malcolmgreaves for the update! I really appreciate your team’s efforts in addressing this issue. Looking forward to what's next!

fix: lance schema does not work

bc32681

Change-Id: I0e50a78624abe52f9878600f77c4da99bbc45620

github-actions bot added the fix label Aug 8, 2025

greptile-apps bot reviewed Aug 8, 2025

View reviewed changes

universalmind303 self-requested a review August 8, 2025 15:48

malcolmgreaves reviewed Aug 11, 2025

View reviewed changes

Jay-ju mentioned this pull request Aug 13, 2025

refactor: make DaftExtension class definition static #4968

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: lance schema does not work #4940

fix: lance schema does not work #4940

ddupg commented Aug 8, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

ddupg commented Aug 11, 2025

Uh oh!

malcolmgreaves left a comment

Uh oh!

malcolmgreaves Aug 11, 2025

Uh oh!

malcolmgreaves Aug 11, 2025

Uh oh!

malcolmgreaves Aug 11, 2025

Uh oh!

malcolmgreaves Aug 11, 2025

Uh oh!

ddupg Aug 12, 2025

Uh oh!

malcolmgreaves Aug 12, 2025

Uh oh!

Jay-ju commented Aug 13, 2025

Uh oh!

malcolmgreaves commented Aug 14, 2025

Uh oh!

malcolmgreaves commented Aug 14, 2025

Uh oh!

Jay-ju commented Aug 14, 2025

Uh oh!

malcolmgreaves commented Aug 15, 2025

Uh oh!

ddupg commented Aug 15, 2025

Uh oh!

Uh oh!


		df.write_lance(lance_dataset_path, schema=daft.schema.Schema.from_pyarrow_schema(schema))

		import lance

fix: lance schema does not work #4940

Are you sure you want to change the base?

fix: lance schema does not work #4940

Conversation

ddupg commented Aug 8, 2025

Changes Made

Related Issues

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

ddupg commented Aug 11, 2025

Uh oh!

malcolmgreaves left a comment

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

ddupg Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

malcolmgreaves Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jay-ju commented Aug 13, 2025

Uh oh!

malcolmgreaves commented Aug 14, 2025

Uh oh!

malcolmgreaves commented Aug 14, 2025

Uh oh!

Jay-ju commented Aug 14, 2025

Uh oh!

malcolmgreaves commented Aug 15, 2025

Uh oh!

ddupg commented Aug 15, 2025

Uh oh!

Uh oh!