[Feature][Connector-V2] ClickHouse source support parallelism #9421

mrtisttt · 2025-06-10T19:14:07Z

Purpose of this pull request

closed #9338

Does this PR introduce any user-facing change?

Yes, tt provides a new feature that enables parallel reading support for ClickHouse.

How was this patch tested?

Add connector unit tests and e2e tests.

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

Copilot

Pull Request Overview

This PR adds support for parallel reading to the ClickHouse V2 connector, enabling users to shard source queries and improve read throughput.

Introduces split enumeration and reader logic to dispatch multiple query “splits” to parallel subtasks
Adds unit tests (ClickhouseChunkSplitterTest) and E2E scenarios (ClickhouseIT) covering numeric, date/time, and string partitioning
Updates connector code (ClickhouseSourceConfig, factory, reader, enumerator) and configuration docs to expose partition_column, partition_num, and optional bounds

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
docs/zh/connector-v2/source/Clickhouse.md, docs/en/connector-v2/source/Clickhouse.md	Documented new `partition_column`, `partition_num`, `partition_lower_bound`, `partition_upper_bound` options
seatunnel-e2e/.../connector-clickhouse-e2e/.../parallel_read/*.conf	Added example configs for string, numeric, date, datetime, and single-shard batch tests
seatunnel-e2e/.../connector-clickhouse-e2e/.../ClickhouseIT.java	Extended integration tests to validate parallel reads
seatunnel-connectors-v2/.../ClickhouseChunkSplitterTest.java	New unit tests for split generation logic
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseSourceConfig.java	Added builder-backed config object for parallel options
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseSourceOptions.java	Defined new connector options interface including partitioning keys
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseSourceFactory.java	Hooked up new `ClickhouseSourceConfig` in factory
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseSource.java	Implemented `SeaTunnelSource` and `SupportParallelism` interfaces
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseSourceSplitEnumerator.java	Added logic to generate and assign splits to readers
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseSourceReader.java	Replaced single-split reader with split-aware `SourceReader` implementation
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickHouseSourceSplit.java	Introduced `SourceSplit` data holder
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseNumericBetweenParametersProvider.java	Helper for numeric/date range parameter generation
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseChunkSplitter.java	Core split-generation logic handling numeric, date/time, and string columns
seatunnel-connectors-v2/connector-clickhouse/src/main/java/.../ClickhouseSourceState.java	Added `serialVersionUID` for state snapshot

Comments suppressed due to low confidence (2)

seatunnel-e2e/seatunnel-connector-v2-e2e/connector-clickhouse-e2e/src/test/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/ClickhouseIT.java:238

The test labeled testClickHouseWithParallelReadDateTimeCol is executing the date-based config instead of the datetime config. Update the path to clickhouse_to_clickhouse_with_parallel_read_datetime.conf.

container.executeJob("/parallel_read/clickhouse_to_clickhouse_with_parallel_read_date.conf");

seatunnel-connectors-v2/connector-clickhouse/src/main/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/source/ClickhouseSourceReader.java:46

The Context type is not imported or fully qualified. It should be SourceReader.Context or add import org.apache.seatunnel.api.source.SourceReader.Context; to resolve compilation errors.

private final Context context;

Copilot · 2025-06-11T01:47:02Z

...apache/seatunnel/connectors/seatunnel/clickhouse/source/ClickhouseSourceSplitEnumerator.java

+    }
+
+    private static int getSplitOwner(String splitId, int numReaders) {
+        return splitId.hashCode() % numReaders;


Using hashCode() % numReaders can yield negative indices when hashCode() is negative. Consider using Math.floorMod(splitId.hashCode(), numReaders) to ensure non-negative reader assignment.

Suggested change

return splitId.hashCode() % numReaders;

return Math.floorMod(splitId.hashCode(), numReaders);

Copilot · 2025-06-11T01:47:02Z

...apache/seatunnel/connectors/seatunnel/clickhouse/source/ClickhouseSourceSplitEnumerator.java

+    public void close() throws IOException {}
+
+    @Override
+    public void addSplitsBack(List<ClickHouseSourceSplit> splits, int subtaskId) {}


The addSplitsBack method is left empty, so splits returned on failure won’t be reassigned. Implement logic to re-add these splits to pendingSplits to support recovery.

Suggested change

public void addSplitsBack(List<ClickHouseSourceSplit> splits, int subtaskId) {}

public void addSplitsBack(List<ClickHouseSourceSplit> splits, int subtaskId) {

synchronized (stateLock) {

log.info("Adding splits back for subtask {}: {}", subtaskId, splits);

pendingSplits.computeIfAbsent(subtaskId, k -> new ArrayList<>()).addAll(splits);

}

}

Copilot · 2025-06-11T01:47:02Z

...apache/seatunnel/connectors/seatunnel/clickhouse/source/ClickhouseSourceSplitEnumerator.java

+
+    @Override
+    public int currentUnassignedSplitSize() {
+        return 0;


currentUnassignedSplitSize always returns 0; consider returning the total count of unassigned splits (e.g., sum of sizes in pendingSplits) to accurately report backlog.

Suggested change

return 0;

synchronized (stateLock) {

return pendingSplits.values().stream()

.mapToInt(List::size)

.sum();

}

Hisoka-X

I notice you use chunk splitter of jdbc. It's a not bad choice. But maybe use

seatunnel/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/DynamicChunkSplitter.java

Line 58 in f4178c7

public class DynamicChunkSplitter extends ChunkSplitter {

is a better way. Then connector can split partition by itself without user config.

mrtisttt · 2025-06-11T04:20:04Z

I notice you use chunk splitter of jdbc. It's a not bad choice. But maybe use

seatunnel/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/DynamicChunkSplitter.java

Line 58 in f4178c7

public class DynamicChunkSplitter extends ChunkSplitter {

is a better way. Then connector can split partition by itself without user config.

Hi @Hisoka-X Thank you sincerely for your review.

Certainly, the splitter in ClickHouse refers to the implementation of JDBC's FixedChunkSplitter and has undergone specific adaptations, optimizations, and fixes tailored for ClickHouse, making its functionality more stable and reliable.

I have indeed noticed the DynamicChunkSplitter and reserved extensible approaches when implementing the FixedChunkSplitter. However, considering that ClickHouse did not support basic parallel reading previously, and the implementation of both FixedChunkSplitter and DynamicChunkSplitter would involve significant code development and testing efforts, they are not implemented at the same time.

The FixedChunkSplitter has stable functionality and allows users a certain degree of flexibility. On the basis of retaining the FixedChunkSplitter, I will submit a PR in the future to support the DynamicChunkSplitter. How about this?

mrtisttt · 2025-06-12T05:03:36Z

@Hisoka-X Hi, Would it be appropriate for me to submit a PR to support the DynamicChunkSplitter in the upcoming period?

Hisoka-X · 2025-06-12T09:32:57Z

Yes, please use DynamicChunkSplitter to reduce unnecessary configuration. In fact, the FixedChunkSplitter is a legacy in JDBC and is not recommended.

mrtisttt · 2025-06-12T17:46:51Z

@Hisoka-X Since clickhouse source connector still use sql query to read source tables and there are still many imperfections, in the subsequent implementation, I will also implement:

query_table(just like jdbc or doris, use table_path option or similar)
where_condition(just like jdbc or doris, use filter.query option or similar)
read_filed(just like doris, use read_filed or similary)

to better support DynamicChunkSpiltter. This is also preparing for the subsequent implementation of multi-table reading. At the same time, perhaps in the future, we can remove the option to use sql query to read source data tables.

How about this idea?

Hisoka-X · 2025-06-13T03:01:46Z

query_table(just like jdbc or doris, use table_path option or similar)

+1 for this, this way can better identify the metadata of the table, thereby improving read performance.

where_condition(just like jdbc or doris, use filter.query option or similar)
read_filed(just like doris, use read_filed or similary)

This way is not necessary at present, and the filter conditions and field selection can actually be reflected in the query.

Moreover, query can preprocess data, which cannot be done by other ways, such as join.

So let support query with DynamicChunkSpiltter first, then support query_table .

mrtisttt · 2025-06-13T03:11:50Z

@Hisoka-X I have considered this model, but there is a problem: does the approach of query_table + query mean that the table name written in the user's query must be exactly the table in query_table? In this case, performing validation would be very difficult.

JeremyXin · 2025-06-13T03:13:18Z

@Hisoka-X Hi, I have thought of a reading method. For the case of query_table, all the parts are obtained through the system table (system.parts), and then the part data is directly read from table, similar to how doris reads the tablet. For filtering situations, filtering can be carried out through the partition list. This approach is similar to the implementation of doris and does not require writing sql to achieve concurrent reading of data tables.

Is this a feasible plan? At present, I have implemented the above-mentioned functions and conducted tests in practical applications.

Carl-Zhou-CN · 2025-06-13T03:16:57Z

@Hisoka-X Hi, I have thought of a reading method. For the case of query_table, all the parts are obtained through the system table (system.parts), and then the part data is directly read from table, similar to how doris reads the tablet. For filtering situations, filtering can be carried out through the partition list. This approach is similar to the implementation of doris and does not require writing sql to achieve concurrent reading of data tables.

Is this a feasible plan? At present, I have implemented the above-mentioned functions and conducted tests in practical applications.

Good idea. Does it support distributed tables?

Hisoka-X · 2025-06-13T03:20:38Z

For the case of query_table, all the parts are obtained through the system table (system.parts), and then the part data is directly read from table, similar to how doris reads the tablet. For filtering situations, filtering can be carried out through the partition list.

@JeremyXin Yes, it's a good way for query_table.

At present, I have implemented the above-mentioned functions and conducted tests in practical applications.

Welcome to contribute!

I have considered this model, but there is a problem: does the approach of query_table + query mean that the table name written in the user's query must be exactly the table in query_table?

@mrtisttt We have a processing priority. If query and table_path are configured at the same time, we get data through query and get meta information through table_path. Please refer

seatunnel/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/utils/JdbcCatalogUtils.java

Lines 175 to 199 in 8a9e8ad

    
           if (StringUtils.isNotEmpty(tableConfig.getTablePath()) 
        
                   && StringUtils.isNotEmpty(tableConfig.getQuery())) { 
        
               TablePath tablePath = jdbcDialect.parse(tableConfig.getTablePath()); 
        
               CatalogTable tableOfPath = null; 
        
               try { 
        
                   tableOfPath = jdbcCatalog.getTable(tablePath); 
        
               } catch (Exception e) { 
        
                   // ignore 
        
                   log.debug("User-defined table path: {}", tablePath); 
        
               } 
        
               CatalogTable tableOfQuery = jdbcCatalog.getTable(tableConfig.getQuery()); 
        
               if (tableOfPath == null) { 
        
                   String catalogName = 
        
                           tableOfQuery.getTableId() == null 
        
                                   ? DEFAULT_CATALOG_NAME 
        
                                   : tableOfQuery.getTableId().getCatalogName(); 
        
                   TableIdentifier tableIdentifier = 
        
                           TableIdentifier.of( 
        
                                   catalogName, 
        
                                   tablePath.getDatabaseName(), 
        
                                   tablePath.getSchemaName(), 
        
                                   tablePath.getTableName()); 
        
                   return CatalogTable.of(tableIdentifier, tableOfQuery); 
        
               } 
        
               return mergeCatalogTable(tableOfPath, tableOfQuery);

JeremyXin · 2025-06-13T03:21:35Z

@Hisoka-X Hi, I have thought of a reading method. For the case of query_table, all the parts are obtained through the system table (system.parts), and then the part data is directly read from table, similar to how doris reads the tablet. For filtering situations, filtering can be carried out through the partition list. This approach is similar to the implementation of doris and does not require writing sql to achieve concurrent reading of data tables.
Is this a feasible plan? At present, I have implemented the above-mentioned functions and conducted tests in practical applications.

Good idea. Does it support distributed tables?

Yes，for distributed tables, reading is actually still performed on the local table. The part under each shard will be obtained and read concurrently.

mrtisttt · 2025-06-13T03:21:56Z

@Hisoka-X Hi, I have thought of a reading method. For the case of query_table, all the parts are obtained through the system table (system.parts), and then the part data is directly read from table, similar to how doris reads the tablet. For filtering situations, filtering can be carried out through the partition list. This approach is similar to the implementation of doris and does not require writing sql to achieve concurrent reading of data tables.
Is this a feasible plan? At present, I have implemented the above-mentioned functions and conducted tests in practical applications.

Good idea. Does it support distributed tables?

Yes, I've also thought about this issue. In ClickHouse, distributed tables pose a significant challenge, which is exactly why I mentioned the strong dependency on query_table.

mrtisttt · 2025-06-13T03:24:23Z

@Hisoka-X @Carl-Zhou-CN @JeremyXin So, how about we first implement the query_table approach? Or should we first implement the query-based DynamicChunkSplitter following the ClickHouse source connector's pattern?

Hisoka-X · 2025-06-13T03:28:31Z

@Hisoka-X @Carl-Zhou-CN @JeremyXin So, how about we first implement the query_table approach? Or should we first implement the query-based DynamicChunkSplitter following the ClickHouse source connector's pattern?

+1. Overall, table_path (let's align its name with the other connectors) is more important.

JeremyXin · 2025-06-13T03:35:32Z

@Hisoka-X @mrtisttt I will submit the preliminary code as soon as possible to see if it is a suitable reading scheme. @mrtisttt
Or maybe you can continue to develop code based on DynamicChunkSplitter? If there are any questions, welcome to communicate together.

Carl-Zhou-CN · 2025-06-13T03:39:28Z

@Hisoka-X @mrtisttt I will submit the preliminary code as soon as possible to see if it is a suitable reading scheme. @mrtisttt Or maybe you can continue to develop code based on DynamicChunkSplitter? If there are any questions, welcome to communicate together.

Looking forward to your submission. I think the two are not in conflict and users can have more choices

mrtisttt · 2025-06-13T03:41:59Z

@Hisoka-X @mrtisttt I will submit the preliminary code as soon as possible to see if it is a suitable reading scheme. @mrtisttt Or maybe you can continue to develop code based on DynamicChunkSplitter? If there are any questions, welcome to communicate together.

Looking forward to your submission. I think the two are not in conflict and users can have more choices

Okay, I understand. Then I'll submit the implementation of DynamicChunkSpillter first. This will indeed provide users with an additional option.

Hisoka-X · 2025-06-13T03:42:32Z

@Hisoka-X In this case, do we still need to implement the query-based DynamicChunkSpillter?

Yes please, query can cover many more scenarios than table_path. We can also use the ability of table_path to read metadata to optimize the query reading sharding, which is also done in JDBC.

Carl-Zhou-CN · 2025-06-13T03:47:48Z

@Hisoka-X In this case, do we still need to implement the query-based DynamicChunkSpillter?

Yes please, query can cover many more scenarios than table_path. We can also use the ability of table_path to read metadata to optimize the query reading sharding, which is also done in JDBC.

+1: In some scenarios, SQL can only be completed in distributed tables and cannot be done locally

mrtisttt · 2025-06-13T03:53:44Z

@Hisoka-X In this case, do we still need to implement the query-based DynamicChunkSpillter?

Yes please, query can cover many more scenarios than table_path. We can also use the ability of table_path to read metadata to optimize the query reading sharding, which is also done in JDBC.

+1: In some scenarios, SQL can only be completed in distributed tables and cannot be done locally

OK, that makes the thinking very clear.

[Feature][Connector-V2] ClickHouse source support parallelism

Loading
Loading status checks…

d73c622

github-actions bot added document connectors-v2 e2e clickhouse labels Jun 10, 2025

mrtisttt added 3 commits June 11, 2025 03:23

[Feature][Connector-V2] Fix code style

Loading
Loading status checks…

7831b37

[Feature][Connector-V2] Fix code style

Loading
Loading status checks…

ae56c86

[Feature][Connector-V2] Fix code specification

Loading
Loading status checks…

a4da59b

hailin0 requested a review from Copilot June 11, 2025 01:44

Copilot AI reviewed Jun 11, 2025

View reviewed changes

Hisoka-X reviewed Jun 11, 2025

View reviewed changes

mrtisttt requested a review from Hisoka-X June 11, 2025 08:51

JeremyXin mentioned this pull request Jun 16, 2025

[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

Open

3 tasks

	return splitId.hashCode() % numReaders;
	return Math.floorMod(splitId.hashCode(), numReaders);

-    public void addSplitsBack(List<ClickHouseSourceSplit> splits, int subtaskId) {}
+    public void addSplitsBack(List<ClickHouseSourceSplit> splits, int subtaskId) {
+        synchronized (stateLock) {
+            log.info("Adding splits back for subtask {}: {}", subtaskId, splits);
+            pendingSplits.computeIfAbsent(subtaskId, k -> new ArrayList<>()).addAll(splits);
+        }
+    }

[Feature][Connector-V2] ClickHouse source support parallelism #9421

Are you sure you want to change the base?

[Feature][Connector-V2] ClickHouse source support parallelism #9421

Uh oh!

Conversation

mrtisttt commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Hisoka-X left a comment

Choose a reason for hiding this comment

Uh oh!

mrtisttt commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrtisttt commented Jun 12, 2025

Uh oh!

Hisoka-X commented Jun 12, 2025

Uh oh!

mrtisttt commented Jun 12, 2025

Uh oh!

Hisoka-X commented Jun 13, 2025

Uh oh!

mrtisttt commented Jun 13, 2025

Uh oh!

JeremyXin commented Jun 13, 2025

Uh oh!

Carl-Zhou-CN commented Jun 13, 2025

Uh oh!

Hisoka-X commented Jun 13, 2025

Uh oh!

JeremyXin commented Jun 13, 2025

Uh oh!

mrtisttt commented Jun 13, 2025

Uh oh!

mrtisttt commented Jun 13, 2025

Uh oh!

Hisoka-X commented Jun 13, 2025

Uh oh!

JeremyXin commented Jun 13, 2025

Uh oh!

Carl-Zhou-CN commented Jun 13, 2025

Uh oh!

mrtisttt commented Jun 13, 2025

Uh oh!

Hisoka-X commented Jun 13, 2025

Uh oh!

Carl-Zhou-CN commented Jun 13, 2025

Uh oh!

mrtisttt commented Jun 13, 2025

Uh oh!

mrtisttt commented Jun 10, 2025 •

edited

Loading

mrtisttt commented Jun 11, 2025 •

edited

Loading