Description
Description
Currently the way to create or update a delta table with partitions is to have the partitions as columns in the data and then call write_deltalake
with partition_by
set to the name of the partition column(s). This makes sense as the partition value can change per row however if the partition values are the same for all the rows (and not in-band), it would be useful to have the ability to specify these explicitly out-of-band, e.g:
write_deltalake(path, data, partition_on=dict(partition_name=DataType.string()))
Use Case
We currrently use pyspark to write delta tables as part of our batch processing where the partition columns are static per batch and not part of the data itself. This involves us doing something along the lines of:
df = df.withColumn(colName="partition_column", col=lit("static value"))
writer = df.write.format("delta").partitionBy("partition_column")
...
write.save(path)
However doing something similar with delta-rs isn't as simple as write_deltalake
expects an ArrowStreamExportable
so we can't inject the partition columns in pure python code anymore.
It also seems slightly inefficient to add extra columns to the data only to have them ultimately not written out to the parquet files in the delta table.
Activity
ion-elgreco commentedon Jun 21, 2025
What do you mean? If you use an engine with a lazy execution model, you can add the partitions. Use datafusion-python for example to add a literal column and then pass the df into deltalake
johnnyg commentedon Jun 22, 2025
Sorry I did mean to say that "in pure python code without an external library" (although technically the engines aren't using pure python to do this stuff).
I have gotten it to work by using a lazy execution engine (so far only ibus/duckdb as both datafusion and pyarrow's datasets throw errors) but not without a significant performance impact hence why it would be great if we didn't need to do this column "injection" in the first place.