Skip to content

Feature: partition on non-column (static) data #3536

Open
@johnnyg

Description

@johnnyg

Description

Currently the way to create or update a delta table with partitions is to have the partitions as columns in the data and then call write_deltalake with partition_by set to the name of the partition column(s). This makes sense as the partition value can change per row however if the partition values are the same for all the rows (and not in-band), it would be useful to have the ability to specify these explicitly out-of-band, e.g:

write_deltalake(path, data, partition_on=dict(partition_name=DataType.string()))

Use Case

We currrently use pyspark to write delta tables as part of our batch processing where the partition columns are static per batch and not part of the data itself. This involves us doing something along the lines of:

df = df.withColumn(colName="partition_column", col=lit("static value"))
writer = df.write.format("delta").partitionBy("partition_column")
...
write.save(path)

However doing something similar with delta-rs isn't as simple as write_deltalake expects an ArrowStreamExportable so we can't inject the partition columns in pure python code anymore.
It also seems slightly inefficient to add extra columns to the data only to have them ultimately not written out to the parquet files in the delta table.

Activity

ion-elgreco

ion-elgreco commented on Jun 21, 2025

@ion-elgreco
Collaborator

However doing something similar with delta-rs isn't as simple as write_deltalake expects an ArrowStreamExportable so we can't inject the partition columns in pure python code anymore.

What do you mean? If you use an engine with a lazy execution model, you can add the partitions. Use datafusion-python for example to add a literal column and then pass the df into deltalake

johnnyg

johnnyg commented on Jun 22, 2025

@johnnyg
Author

However doing something similar with delta-rs isn't as simple as write_deltalake expects an ArrowStreamExportable so we can't inject the partition columns in pure python code anymore.

What do you mean? If you use an engine with a lazy execution model, you can add the partitions. Use datafusion-python for example to add a literal column and then pass the df into deltalake

Sorry I did mean to say that "in pure python code without an external library" (although technically the engines aren't using pure python to do this stuff).

I have gotten it to work by using a lazy execution engine (so far only ibus/duckdb as both datafusion and pyarrow's datasets throw errors) but not without a significant performance impact hence why it would be great if we didn't need to do this column "injection" in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @johnnyg@ion-elgreco

        Issue actions

          Feature: partition on non-column (static) data · Issue #3536 · delta-io/delta-rs