Skip to content

[Feature] Optimize CDC sync database action to avoid blocking on listTables operation with large number of tables #5955

@huyuanfeng2018

Description

@huyuanfeng2018

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

When using CDC synchronous database operations on a database containing a large number of tables (hundreds of thousands), the "catalog. listTables (database)" operation in "SyncDataActionBase" may take a long time to complete, causing the entire synchronization job to start blocking for a long time. This will significantly affect the duration of CDC synchronization task initiation.

Solution

Current Behavior

The current implementation calls catalog.listTables(database) during initialization and maintains a createdTables set to track table creation status. This approach:

  1. Blocks the entire sync process while listing all tables
  2. Consumes unnecessary memory to maintain the createdTables set
  3. Performs redundant operations when tables are created lazily

Expected Behavior

The sync process should:

  1. Avoid blocking on listTables operation during initialization
  2. Create tables lazily when needed without maintaining a global createdTables set
  3. Improve overall performance for databases with large numbers of tables

Solution

Optimize the table creation logic by:

  1. Removing the upfront listTables call in SyncDatabaseActionBase
  2. Eliminating the createdTables set from RichCdcMultiplexRecordEventParser
  3. Implementing lazy table creation in CdcDynamicTableParsingProcessFunction with existence checks

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions