[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

JeremyXin · 2025-06-16T12:27:45Z

Purpose of this pull request

Clickhouse support parallelism reading schema.
related pr #9421

The Clickhouse source connector supports parallel reading of data. For query table mode, the parallel reading is implemented based on the part file of table, which is obtained from the system.parts table.
The partition_list and filter_query parameter is used to filter data.
The batch_size parameter is used to control the amount of data read each time to avoid OOM exception.

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

…g schema

Copilot

Pull Request Overview

This PR adds support for parallel schema reading in the ClickHouse connector by leveraging the table part files from the system.parts table. Key changes include:

New configuration options (e.g., partition_list, filter_query, batch_size) and test cases to support parallel reading.
Updates to the core proxy, splitter, enumerator, source reader, and associated state management for splitting and reading parts concurrently.
Documentation updates explaining the new parallel reader features.

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
seatunnel-e2e/connector-clickhouse-e2e/.../clickhouse_with_parallelism_read.conf	Added test configuration for parallel read demonstration
ClickhouseIT.java	Added new test methods and constants to verify parallel reading functionality
TablePartSplitterTest.java	Introduced tests for generating splits, including duplicate parts handling
ClickhouseValueReaderTest.java	Added tests to validate various batch reading scenarios
ClickhouseProxy.java	Implemented methods to retrieve part lists and query data per part
ClickhouseSourceState.java	Updated state object to include pending splits
TablePartSplitter.java	Created new splitting logic for ClickHouse parts
ClickhouseSourceSplitEnumerator.java	Added new split enumerator to support parallel splits assignment
ClickhouseSourceSplit.java	Defined a split abstraction based on ClickHouse parts
ClickhouseValueReader.java	Modified value reader to iteratively process splits and update part offsets
ClickhouseSourceTable.java	Updated source table configuration to include new options
ClickhouseSourceReader.java	Refactored source reader to integrate parallelism mode with split queue management
ClickhouseSourceFactory.java	Enhanced factory to build source tables and incorporate new parallelism parameters
ClickhouseSource.java	Updated the connector interface to support parallel reading with new enumerator and reader
ClickhousePart.java	Introduced Comparable interface implementation (stubbed in current diff)
ClickhouseTable.java	Added getter for local database name
ClickhouseConnectorErrorCode.java	Added new error codes for part retrieval and query issues
ClickhouseSourceOptions.java	Defined new options: part_size, partition_list, batch_size, and filter_query
ClickhouseBaseOptions.java	Added table option to support table name configuration
docs/en/connector-v2/source/Clickhouse.md	Updated documentation with instructions and tips for parallel reading

Comments suppressed due to low confidence (1)

seatunnel-connectors-v2/connector-clickhouse/src/main/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/source/ClickhousePart.java:77

The compareTo method always returns 0, which effectively treats all instances as equal. Consider implementing a proper comparison (for example, based on the part name) or removing Comparable if natural ordering is not intended.

public int compareTo(ClickhousePart o) { return 0; }

Copilot · 2025-06-17T01:57:08Z

...src/main/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/util/ClickhouseProxy.java

+                        "select name from system.parts where database = '%s' and table = '%s'",
+                        database, table);
+
+        if (partitionList != null && !partitionList.isEmpty()) {


The SQL query in getPartList is built by directly concatenating the partition list values. Consider using a parameterized query or properly escaping input values to mitigate the risk of SQL injection.

Hisoka-X · 2025-06-17T02:52:51Z

docs/en/connector-v2/source/Clickhouse.md

 | username          | String | Yes      | -                      | `ClickHouse` user username.                                                                                                                                                                                                                                                                                 |
 | password          | String | Yes      | -                      | `ClickHouse` user password.                                                                                                                                                                                                                                                                                 |
+| database          | String | NO       | -                      | The `ClickHouse` database.                                                                                                                                                                                                                                                                                  |
+| table             | String | NO       | -                      | The `ClickHouse` table. If it is a distributed table, the cluster is obtained based on the table engine. If it is a local table, build the cluster based on the input `host`                                                                                                                                |


If the table_path parameter is used instead, should the database parameter also be removed? Is it uniformly represented by the table_path parameter?

Hisoka-X · 2025-06-17T03:14:44Z

...use-e2e/src/test/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/ClickhouseIT.java

@@ -211,6 +213,17 @@ public void testClickHouseWithMultiTableSink(TestContainer container) throws Exc
        }
    }

+    @TestTemplate
+    public void testClickhouseWithParallelismRead(TestContainer testContainer)


could you add test case to verify filter_query and partition_list work properly?

Ok. I will add more test cases.

Carl-Zhou-CN · 2025-06-18T08:40:05Z

...src/main/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/util/ClickhouseProxy.java

+
+        String sql =
+                String.format(
+                        "select * from %s.%s where %s limit %d, %d",


Is this implementation because for a single part, 'limit m,n' can guarantee the order?

This implementation is designed to read parts in batches to avoid large amounts of data when reading in parallel. Each ClickhousePart object has an offset attribute to record the offset of the current part that has been read, thereby ensuring the order of batch reading.

What I mean is this, whether to ensure that it won't repeat without sorting?

After reading the clickhouse documentation, I found that clickhouse supports a kind of LIMIT... WITH TIES way, this can ensure that the data with the same value in the Order By field will be queried in the same batch. Meanwhile, the Order By field of the table is used to define the sorting key when query part. Can this solution solve the problem?

I think it's good.

…er, add sql parallelism read strategy and fix other problem.

JeremyXin · 2025-06-20T15:04:37Z

I have made the following updates:

Add new e2e test cases
Add the table_path parameter and make corresponding configuration modifications (including e2e configuration)
Add sql parallelism read strategy. If sql parameters are specified, the parallel reading is implemented based on the parallelism execution of local table-based queries on each shard of the cluster
Fix the data duplication issue that may be caused by limit
Other newly added code and optimizations

Thanks for helping with the review!

Hisoka-X · 2025-06-23T03:19:58Z

.../java/org/apache/seatunnel/connectors/seatunnel/clickhouse/config/ClickhouseBaseOptions.java

@@ -40,6 +40,13 @@ public class ClickhouseBaseOptions {
                    .noDefaultValue()
                    .withDescription("Clickhouse database name");

+    /** Clickhouse database name */
+    public static final Option<String> TABLE =
+            Options.key("table")


Suggested change

Options.key("table")

Options.key("table_path")

JeremyXin added 2 commits June 16, 2025 20:18

[Improve][connector-clickhouse] Clickhouse support parallelism readin…

305b336

…g schema

[Improve][connector-clickhouse] update Clickhouse.md

Loading
Loading status checks…

baeb16b

github-actions bot added document connectors-v2 e2e clickhouse labels Jun 16, 2025

[Improve][connector-clickhouse] fix code style error

Loading
Loading status checks…

bb21fe0

nielifeng requested a review from Copilot June 17, 2025 01:56

Copilot AI reviewed Jun 17, 2025

View reviewed changes

Hisoka-X reviewed Jun 17, 2025

View reviewed changes

Carl-Zhou-CN reviewed Jun 18, 2025

View reviewed changes

[Improve][connector-clickhouse] add e2e tests, add table_path paramet…

Loading
Loading status checks…

6678790

…er, add sql parallelism read strategy and fix other problem.

Hisoka-X reviewed Jun 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

JeremyXin commented Jun 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 17, 2025

Uh oh!

Hisoka-X Jun 17, 2025

Uh oh!

JeremyXin Jun 18, 2025

Uh oh!

Hisoka-X Jun 19, 2025

Uh oh!

Hisoka-X Jun 17, 2025

Uh oh!

JeremyXin Jun 17, 2025

Uh oh!

Carl-Zhou-CN Jun 18, 2025

Uh oh!

JeremyXin Jun 18, 2025

Uh oh!

Carl-Zhou-CN Jun 18, 2025

Uh oh!

JeremyXin Jun 18, 2025

Uh oh!

Carl-Zhou-CN Jun 19, 2025

Uh oh!

JeremyXin commented Jun 20, 2025

Uh oh!

Hisoka-X Jun 23, 2025

Uh oh!

	\| table \| String \| NO \| - \| The `ClickHouse` table. If it is a distributed table, the cluster is obtained based on the table engine. If it is a local table, build the cluster based on the input `host` \|
	\| table_path \| String \| NO \| - \| The `ClickHouse` table. If it is a distributed table, the cluster is obtained based on the table engine. If it is a local table, build the cluster based on the input `host` \|

[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

Are you sure you want to change the base?

[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

Conversation

JeremyXin commented Jun 16, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JeremyXin commented Jun 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!