Open
Description
Problem Statement
After running OPTIMIZE
(compact) on a Delta table, the file/row layout is not guaranteed to be globally sorted by (objectId, dateTime)
clients must perform an extra sort on read, which adds latency, complexity, and resource pressure.
Background & Motivation
- Delta Lake’s current compaction merges files but does not enforce a global sort order.
- Existing Z-order / clustering features are orthogonal; we need deterministic, monotonic ordering.
Proposed Feature
-
Add a global sort phase to the compaction pipeline:
- After bin-packing files, feed all pages/records through a DataFusion
df.sort(["objectId","dateTime"])
before writing. - Preserve the existing unsorted path when sorting is disabled for backward compatibility.
- After bin-packing files, feed all pages/records through a DataFusion
-
Expose new options on the
OptimizeBuilder
(Rust) andDeltaTable.optimize
(Python):sort_enabled: bool
(default = true)sort_columns: Vec<String>
/List[str]
(default =["objectId","dateTime"]
)- Builder methods:
- Rust:
.with_sort_columns(&[...])
,.disable_sort()
- Python:
dt.optimize.compact(sort_enabled=False, sort_columns=["foo"])
- Rust:
Requirements & Constraints
- Strict Ordering: Global sort by
(objectId, dateTime)
across all partitions and pages. - Performance: Sorting should not unduly impact compaction throughput; use DataFusion’s spillable memory pools.
- Configurable: Users can disable sorting or choose a different sort key.
- Backward Compatible: Unsorted compaction remains available; default behavior may change only when opted in.