[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

aiwenmo · 2024-12-23T15:57:28Z

Premise

MysqlCDC supports snapshot mode

MysqlCDC in Flink CDC (MySqlSource) supports StartupMode.SNAPSHOT and is of Boundedness.BOUNDED, and can run in RuntimeExecutionMode.BATCH.

Streaming VS Batch

Stream mode is suitable for job types including: jobs with high real-time requirements; in non-real-time scenarios, stateless jobs with many Shuffle steps; jobs that require continuous and stable data processing capabilities; jobs with small states, simple topologies and low fault tolerance costs.

Batch mode is suitable for job types including: in non-real-time scenarios, jobs with a large number of stateful operators; jobs that require high resource utilization; jobs with large states, complex topologies and low fault tolerance costs.

Expectation

Full snapshot synchronization

The FlinkCDC YAML job only reads the full snapshot data of the database and then writes it to the target database in Streaming or Batch mode. It is mainly used for full catch-up.

Currently, the SNAPSHOT startup strategy of the FlinkCDC YAML job can run correctly in the Streaming mode; it cannot run correctly in the Batch mode.

Full-incremental offline

The FlinkCDC YAML job collects full snapshot data + incremental log data from the final Offset of the full-incremental snapshot algorithm to the current EndingOffset for the first run; for subsequent runs, it collects from the last EndingOffset to the current EndingOffset.

The job runs in Batch mode. Users can schedule the job periodically, tolerate data delays for a certain period of time (such as hourly or daily), and ensure eventual consistency. Since the periodically scheduled incremental job only collects logs between the last EndingOffset and the current EndingOffset, duplicate full collection of data is avoided.

Test

Full snapshot synchronization in Batch mode

After removing the PartitionOperator, all operators will be chained into one PipelinedRegion and can run correctly;
When there are multiple PipelinedRegions, only the first PipelinedRegion is in the jobgraph and it cannot run correctly;
After removing the SchemaOperator, when there are multiple PipelinedRegions, a correct jobgraph can also be generated, but the sink requires the registration of the coordinator operator.

Solution

Use StartupMode.SNAPSHOT + Streaming for full snapshot synchronization

There is no need to modify the source code. For MysqlCDC, after specifying StartupMode.SNAPSHOT, the full snapshot synchronization job of the entire database can be run in the streaming mode. Although it is not the optimal solution, this capability can be achieved currently.

Expand the FlinkPipelineComposer applicable to the Batch mode to support full Batch synchronization

Topology graph: Source -> PreTransform -> PostTransform -> Router -> PartitionBy -> Sink

There are no change events in the Batch mode, and Schema Evolution does not need to be considered. In addition, the automatic table creation is completed before the job starts.
The field derivation of transform can be placed before the job starts instead of during runtime. Other operations such as the derivation of Router can also be placed before the job starts.
Workload: Implement the Batch construction strategy of FlinkPipelineComposer. Router needs to be independent, and Sink needs to be extended or transformed to support the implementation that does not require a coordinator (it would be better if Batch writing can be achieved).

Expand StartupMode to support users specifying the Offset range to support incremental offline synchronization

Allow users to specify the collection Offset range of the binlog, and then the user's own platform records the EndingOffset of each execution, as well as the periodic scheduling by the platform.

Discussion

1.Is it necessary to implement support for Batch mode because the benefits brought by Batch are small or the performance is not as good as Streaming. Specifically, which Batch optimizations can be used?

2.Whether the full-incremental offline method should be implemented (users can periodically schedule incremental log synchronization)?

Code implementation

Topology graph: Source -> PreTransform -> PostTransform -> SchemaBatchOperator -> PartitionBy(Batch) -> BatchSink
ps: The data flow only contains CreateTableEvent, CreateTableCompletedEvent, and DataChangeEvent (insert).

Implementation ideas

1.Source first issues all CreateTableEvents, then appends a CreateTableCompletedEvent, and then issues snapshot data.
2.PreTransform and PostTransform directly issue the CreateTableCompletedEvent, and there are no changes in other cases.
3.When SchemaBatchOperator receives the CreateTableEvent, it is only stored in the cache and no events are issued.
4.When SchemaBatchOperator receives the CreateTableCompletedEvent, the widest downstream table structure is deduced based on the router rule, and then the table creation statement is executed in the external data source. Subsequently, the wide table structure is issued to BatchPrePartition.
5.BatchPrePartition broadcasts the CreateTableEvent to PostPartition. BatchPrePartition partitions and distributes the DataChangeEvent to PostPartition based on the table ID and primary key information.
6.PostPartition issues the CreateTableEvent and DataChangeEvent to BatchSink, and BatchSink performs batch writing.

Implementation effect

Computing node 1: Source -> PreTransform -> PostTransform -> SchemaBatchOperator -> BatchPrePartition
Computing node 2: PostPartition -> BatchSink
Batch mode: Computing node 2 starts computing only after computing node 1 is completely finished.

aiwenmo · 2024-12-25T14:59:47Z

Code implementation

Topology graph: Source -> PreTransform -> PostTransform -> SchemaBatchOperator-> PartitionBy(Batch) -> BatchSink

add SchemaBatchOperator which removed the processing of schema change event and removed Coordinator.
add RegularPrePartitionBatchOperator which removed SchemaEvolutionClient.
add DataBatchSinkFunctionOperator and DataBatchSinkWriterOperator which removed SchemaEvolutionClient.
remove SchemaRegistry in batch mode.

aiwenmo · 2024-12-27T17:04:07Z

DataSource will send CreateTableCompletedEvent after sending all CreateTableEvent.
add CreateTableCompletedEvent to notify SchemaBatchOperator to merge all CreateTableEvent.

aiwenmo · 2024-12-31T01:16:42Z

During the test, a new bug was discovered and has been fixed. This PR relies on this fix. #3826

github-actions bot added values-pipeline-connector composer common labels Dec 23, 2024

[FLINK-36931][cdc] FlinkCDC YAML supports batch mode

9b3b534

aiwenmo force-pushed the FLINK-36931 branch from 04b88da to 9b3b534 Compare December 24, 2024 15:54

aiwenmo added 2 commits December 24, 2024 23:55

rebase onto master

891ffe9

Support regular batch mode

15522a6

github-actions bot added the runtime label Dec 25, 2024

Support merge create table event in source operator

5ee1f9b

github-actions bot added the mysql-pipeline-connector label Dec 26, 2024

Support merge create table event in schema batch operator

d50609f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

aiwenmo commented Dec 23, 2024 •

edited

Loading

aiwenmo commented Dec 25, 2024

aiwenmo commented Dec 27, 2024

aiwenmo commented Dec 31, 2024

[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

Are you sure you want to change the base?

[FLINK-36931][cdc] FlinkCDC YAML supports batch mode #3812

Conversation

aiwenmo commented Dec 23, 2024 • edited Loading

Premise

MysqlCDC supports snapshot mode

Streaming VS Batch

Expectation

Full snapshot synchronization

Full-incremental offline

Test

Full snapshot synchronization in Batch mode

Solution

Use StartupMode.SNAPSHOT + Streaming for full snapshot synchronization

Expand the FlinkPipelineComposer applicable to the Batch mode to support full Batch synchronization

Expand StartupMode to support users specifying the Offset range to support incremental offline synchronization

Discussion

Code implementation

Implementation ideas

Implementation effect

aiwenmo commented Dec 25, 2024

Code implementation

aiwenmo commented Dec 27, 2024

aiwenmo commented Dec 31, 2024

aiwenmo commented Dec 23, 2024 •

edited

Loading