Skip to content

Allow compacting and maintaining a monotonic ordering #3538

Open
@tolleybot

Description

@tolleybot

Problem Statement

After running OPTIMIZE (compact) on a Delta table, the file/row layout is not guaranteed to be globally sorted by (objectId, dateTime)
clients must perform an extra sort on read, which adds latency, complexity, and resource pressure.

Background & Motivation

  • Delta Lake’s current compaction merges files but does not enforce a global sort order.
  • Existing Z-order / clustering features are orthogonal; we need deterministic, monotonic ordering.

Proposed Feature

  1. Add a global sort phase to the compaction pipeline:

    • After bin-packing files, feed all pages/records through a DataFusion df.sort(["objectId","dateTime"]) before writing.
    • Preserve the existing unsorted path when sorting is disabled for backward compatibility.
  2. Expose new options on the OptimizeBuilder (Rust) and DeltaTable.optimize (Python):

    • sort_enabled: bool (default = true)
    • sort_columns: Vec<String> / List[str] (default = ["objectId","dateTime"])
    • Builder methods:
      • Rust: .with_sort_columns(&[...]), .disable_sort()
      • Python: dt.optimize.compact(sort_enabled=False, sort_columns=["foo"])

Requirements & Constraints

  • Strict Ordering: Global sort by (objectId, dateTime) across all partitions and pages.
  • Performance: Sorting should not unduly impact compaction throughput; use DataFusion’s spillable memory pools.
  • Configurable: Users can disable sorting or choose a different sort key.
  • Backward Compatible: Unsorted compaction remains available; default behavior may change only when opted in.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions