feat: add clickhouse data sink #4850

huleilei · 2025-07-25T11:33:36Z

Changes Made

Add the clickhouse data sink, and use it to writes the DataFrame to a ClickHouse table

Related Issues

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

greptile-apps

Greptile Summary

This PR adds ClickHouse data sink functionality to Daft, enabling users to write DataFrames directly to ClickHouse tables. The implementation follows Daft's established DataSink pattern used by other database connectors like Lance and Turbopuffer.

The changes include:

New dependency management: Added clickhouse-connect >= 0.8.18 as an optional dependency in pyproject.toml and pinned version in requirements-dev.txt
ClickHouse data sink implementation: Created ClickHouseDataSink class in daft/io/clickhouse/clickhouse_data_sink.py that handles connection management, micropartition processing, and result aggregation
DataFrame integration: Added write_clickhouse() method to the DataFrame class with comprehensive parameter support for host, port, credentials, database, table, and custom client/write options
Module structure: Created the daft/io/clickhouse/ package directory with __init__.py

The implementation integrates with Daft's existing data sink architecture, using the three-phase pattern (write, finalize) for distributed execution. Users can now write DataFrames to ClickHouse using df.write_clickhouse(host="localhost", port=8123, table="my_table") and receive aggregated statistics about the write operation.

Confidence score: 2/5

This PR has several critical issues that need to be addressed before it can be safely merged
The main concerns are incomplete module setup, naming inconsistencies, and potential configuration security issues
Files that need more attention: daft/io/clickhouse/__init__.py, daft/io/clickhouse/clickhouse_data_sink.py, and pyproject.toml

Critical Issues:

Empty module interface: daft/io/clickhouse/__init__.py is empty, making the ClickHouse functionality inaccessible to users who try to import it
Naming inconsistencies: The data sink implementation contains multiple references to "ByteHouse" instead of "ClickHouse" in comments and variable names, suggesting incomplete adaptation from another codebase
Configuration security flaw: The client kwargs merging logic allows user-provided parameters to potentially override explicit connection parameters
Incomplete dependency setup: The 'clickhouse' extra is not included in the 'all' extras list in pyproject.toml
Unused code: The ClickHouseConfig class is defined but never used

_{5 files reviewed, 6 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-07-25T11:34:19Z

pyproject.toml

@@ -25,6 +25,7 @@ requires-python = ">=3.9"
 all = ["daft[aws, azure, gcp, ray, pandas, numpy, iceberg, deltalake, spark, sql, unity]"]


logic: The 'all' extra should include 'clickhouse' to ensure users who install all optional dependencies get ClickHouse support

Suggested change

all = ["daft[aws, azure, gcp, ray, pandas, numpy, iceberg, deltalake, spark, sql, unity]"]

all = ["daft[aws, azure, gcp, ray, pandas, numpy, iceberg, deltalake, spark, sql, unity, clickhouse]"]

greptile-apps · 2025-07-25T11:34:37Z

daft/io/clickhouse/clickhouse_data_sink.py

+class ClickHouseConfig:
+    def __init__(self, host: str, port: int, user: str, password: str, database: str, table: str) -> None:
+        self.host = host
+        self.port = port
+        self.user = user
+        self.password = password
+        self.database = database
+        self.table = table


style: This ClickHouseConfig class is defined but never used in the implementation. Consider removing it or integrating it into the main class design.

greptile-apps · 2025-07-25T11:34:38Z

daft/io/clickhouse/clickhouse_data_sink.py

+        client_kwargs = client_kwargs or {}
+        self._client_kwargs = {**client_kwargs, **self._client_kwargs}


logic: The merging order allows client_kwargs to override explicit parameters. This could cause unexpected behavior if users pass conflicting values.

Suggested change

client_kwargs = client_kwargs or {}

self._client_kwargs = {**client_kwargs, **self._client_kwargs}

self._client_kwargs = {**self._client_kwargs, **client_kwargs}

greptile-apps · 2025-07-25T11:34:38Z

daft/io/clickhouse/clickhouse_data_sink.py

+        return self._result_schema
+
+    def write(self, micropartitions: Iterator[MicroPartition]) -> Iterator[WriteResult[QuerySummary]]:
+        """Writes to Bytehouse from the given micropartitions."""


syntax: Comment incorrectly references "Bytehouse" instead of "ClickHouse".

Suggested change

"""Writes to Bytehouse from the given micropartitions."""

"""Writes to ClickHouse from the given micropartitions."""

greptile-apps · 2025-07-25T11:34:39Z

daft/io/clickhouse/clickhouse_data_sink.py

+    def write(self, micropartitions: Iterator[MicroPartition]) -> Iterator[WriteResult[QuerySummary]]:
+        """Writes to Bytehouse from the given micropartitions."""
+        # socket cannot be serialized, so we need to create a new client in write
+        bh_client = get_client(**self._client_kwargs)


style: Variable name bh_client suggests ByteHouse instead of ClickHouse. Consider renaming for consistency.

Suggested change

bh_client = get_client(**self._client_kwargs)

ch_client = get_client(**self._client_kwargs)

greptile-apps · 2025-07-25T11:34:40Z

daft/io/clickhouse/clickhouse_data_sink.py

+            bh_client.close()
+
+    def finalize(self, write_results: list[WriteResult[QuerySummary]]) -> MicroPartition:
+        """Finish write to ByteHouse dataset. Returns a DataFrame with the stats of the dataset."""


syntax: Comment incorrectly references "ByteHouse" instead of "ClickHouse".

Suggested change

"""Finish write to ByteHouse dataset. Returns a DataFrame with the stats of the dataset."""

"""Finish write to ClickHouse dataset. Returns a DataFrame with the stats of the dataset."""

Jay-ju · 2025-07-25T11:35:56Z

daft/io/clickhouse/clickhouse_data_sink.py

+
+
+class ClickHouseDataSink(DataSink[QuerySummary]):
+    def __init__(


The ClickHouseConfig has been defined. Why are the parameters still separated here?

ClickHouseConfig is no need

Jay-ju · 2025-07-25T11:38:26Z

requirements-dev.txt

+
+
+#clickhouse
+clickhouse-connect==0.8.18


Does this version have to be fixed? Why is it different from pyproject.toml?

Jay-ju · 2025-07-25T11:42:09Z

daft/io/clickhouse/clickhouse_data_sink.py

+        tbl = MicroPartition.from_pydict(
+            {
+                "total_written_rows": pa.array([total_written_rows], pa.int64()),
+                "total_written_bytes": pa.array([total_written_bytes], pa.int64()),


Does total_written_bytes only exist in the datasink of ck and not in other datasinks? Can it be abstracted into the metrics of sink?

I think total_written_bytes field is for ck. Different sinks correspond to different indicator information, right

codecov · 2025-07-25T11:59:47Z

Codecov Report

❌ Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 79.28%. Comparing base (bda006a) to head (9d50769).

Files with missing lines	Patch %	Lines
daft/io/clickhouse/clickhouse_data_sink.py	97.77%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4850      +/-   ##
==========================================
+ Coverage   79.27%   79.28%   +0.01%     
==========================================
  Files         908      910       +2     
  Lines      125867   125920      +53     
==========================================
+ Hits        99777    99833      +56     
+ Misses      26090    26087       -3

Files with missing lines	Coverage Δ
daft/dataframe/dataframe.py	`86.89% <100.00%> (+0.05%)`	⬆️
daft/io/clickhouse/__init__.py	`100.00% <100.00%> (ø)`
daft/io/clickhouse/clickhouse_data_sink.py	`97.77% <97.77%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

greptile-apps

Greptile Summary

This PR adds ClickHouse data sink functionality to Daft, enabling users to write DataFrames directly to ClickHouse tables. The implementation follows Daft's established DataSink pattern used by other database connectors like Lance and Turbopuffer.

The recent changes address several critical issues from the previous review:

Fixed module interface: The daft/io/clickhouse/__init__.py file now properly imports and exposes the ClickHouseDataSink class through __all__
Corrected dependency configuration: The 'clickhouse' extra is now properly included in the 'all' extras list in pyproject.toml
Improved parameter handling: The ClickHouse data sink implementation properly merges connection parameters with user-provided client kwargs
Proper development setup: Added clickhouse-connect>=0.8.18 to requirements-dev.txt for development environment consistency

The implementation integrates seamlessly with Daft's existing data sink architecture, using the three-phase pattern (write, finalize) for distributed execution. Users can now write DataFrames to ClickHouse using df.write_clickhouse(host="localhost", port=8123, table="my_table") and receive aggregated statistics about the write operation. The data sink handles Arrow table conversion, manages ClickHouse client connections appropriately for distributed environments, and provides proper resource cleanup.

Confidence score: 4/5

This PR is now much safer to merge after addressing the critical issues from the previous review
The implementation follows established patterns and includes proper error handling and resource management
Files still needing attention: daft/io/clickhouse/clickhouse_data_sink.py for potential parameter override security considerations

_{5 files reviewed, 2 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-07-25T12:18:29Z

daft/dataframe/dataframe.py

+        user: Optional[str] = None,
+        password: Optional[str] = None,
+        database: Optional[str] = None,
+        table: Optional[str] = None,


logic: The table parameter should be required since the ClickHouseDataSink constructor raises ValueError if table is None/empty

greptile-apps · 2025-07-25T12:18:49Z

requirements-dev.txt

+#clickhouse
+clickhouse-connect>=0.8.18


style: Consider adding a comment explaining why version >=0.8.18 is required, similar to other pinned dependencies in this file.

Jay-ju · 2025-07-26T01:55:54Z

daft/io/clickhouse/clickhouse_data_sink.py

+        password: str | None = None,
+        database: str | None = None,
+        table: str | None = None,
+        client_kwargs: dict[str, Any] | None = None,


Isn't the host/port in client_kwargs? What is usually filled in here?

other params， eg： timeout、access_token

huleilei · 2025-08-11T13:25:24Z

@Jay-ju Can you help me review code. Thanks.

github-actions bot added the feat label Jul 25, 2025

greptile-apps bot reviewed Jul 25, 2025

View reviewed changes

Jay-ju reviewed Jul 25, 2025

View reviewed changes

huleilei force-pushed the git_ck_1 branch from a798d3d to f4e87bb Compare July 25, 2025 12:13

huleilei closed this Jul 25, 2025

huleilei reopened this Jul 25, 2025

greptile-apps bot reviewed Jul 25, 2025

View reviewed changes

Jay-ju reviewed Jul 26, 2025

View reviewed changes

huleilei force-pushed the git_ck_1 branch from f4e87bb to c0dbdd1 Compare July 26, 2025 03:32

feat: add clickhouse data sink

73457c7

huleilei force-pushed the git_ck_1 branch from c0dbdd1 to 73457c7 Compare July 29, 2025 11:48

huleilei added 10 commits July 29, 2025 19:49

Merge branch 'main' into git_ck_1

43dfabe

Merge branch 'main' into git_ck_1

59fd8d5

Merge branch 'main' into git_ck_1

34283d0

Update clickhouse_data_sink.py

3cac917

change tests

bc3aca9

Merge branch 'main' into git_ck_1

f9e3e9f

check styple

51ec85e

Update clickhouse_data_sink.py

ba7372d

Merge branch 'main' into git_ck_1

935c304

Merge branch 'main' into git_ck_1

9d50769

		@@ -25,6 +25,7 @@ requires-python = ">=3.9"
		all = ["daft[aws, azure, gcp, ray, pandas, numpy, iceberg, deltalake, spark, sql, unity]"]

		client_kwargs = client_kwargs or {}
		self._client_kwargs = {client_kwargs, self._client_kwargs}

	client_kwargs = client_kwargs or {}
	self._client_kwargs = {client_kwargs, self._client_kwargs}
	self._client_kwargs = {self._client_kwargs, client_kwargs}

	"""Writes to Bytehouse from the given micropartitions."""
	"""Writes to ClickHouse from the given micropartitions."""

	bh_client = get_client(**self._client_kwargs)
	ch_client = get_client(**self._client_kwargs)

	"""Finish write to ByteHouse dataset. Returns a DataFrame with the stats of the dataset."""
	"""Finish write to ClickHouse dataset. Returns a DataFrame with the stats of the dataset."""



		class ClickHouseDataSink(DataSink[QuerySummary]):
		def __init__(



		#clickhouse
		clickhouse-connect==0.8.18

		#clickhouse
		clickhouse-connect>=0.8.18

feat: add clickhouse data sink #4850

Are you sure you want to change the base?

feat: add clickhouse data sink #4850

Uh oh!

Conversation

huleilei commented Jul 25, 2025

Changes Made

Related Issues

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jay-ju Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

huleilei Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jay-ju Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jay-ju Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

huleilei Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Jay-ju Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

huleilei Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

huleilei commented Aug 11, 2025

Uh oh!

Uh oh!

Jay-ju Jul 25, 2025 •

edited

Loading

codecov bot commented Jul 25, 2025 •

edited

Loading