A Deep Deep Dive into df.write_lance() #4600
everettVT
started this conversation in
Show and tell
Replies: 2 comments
-
I think this would be a great piece of source material as a prompt for a deep research prompted blog post. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Good job, @everettVT. I noticed you've done some work on the daft lance connector. I'd like to know your thoughts. Let's see if we can cooperate to break down this work? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
So as a part of another discussion on adding LanceTable support to the daft catalog, I began reviewing the daft.read_lance and df.write_lance methods and went down a huge rabbit hole which helped me understand both how daft performs distributed writes, and how lance fragements work.
I figured I would share my notes that I made a long the way as an exercise.
How Does Daft write Lance Datasets?
dataframe.py
At
daft/dataframe/dataframe.py
with def write_lance()lance_data_sink.py
using
daft/dataframe/lance_data_sink.py
class LanceDataSinkwrite_sink() in dataframe.py
whose methods are executed with DataFrame.write_sink()
write_lance in daft/logical/builder.py
which routes the sink operation through LogicalPlanBuilder.write_lance() from
daft/logical/builder.py
lance_write in daft/execution/physical_plan.py
which runs def lance_write() from
daft/execution/physical_plan.py
WriteLance in daft/execution/execution_step.py
which executes WriteLance in
daft/execution/execution_step.py
write_lance in daft/execution/record_batch_io.py
which FINALLY executes write_lance() in
daft/execution/recordbatch_io.py
Beta Was this translation helpful? Give feedback.
All reactions