zjjiang created FLINK-36701:
-------------------------------

             Summary: Pipeline failover again after handling a schema change 
event as the first event after a failover
                 Key: FLINK-36701
                 URL: https://issues.apache.org/jira/browse/FLINK-36701
             Project: Flink
          Issue Type: Bug
          Components: Flink CDC
    Affects Versions: cdc-3.2.0, cdc-3.3.0
            Reporter: zjjiang
         Attachments: image-2024-11-13-16-29-30-065.png

Currently, directly after a failover, when the pipeline first handles a schema 
change event (e.g. addColumnEvent) and then a DataChangeEvent, it may cause the 
job to fail again as sink has repeatedly applied that schema change.

The cause of the problem can be explained as follows:
1. SinkWriterOperator now requests the latest schema when it receives the first 
non-createTableEvent schema change event (assuming there is no schema in the 
local cache).
2. The schema manager applies the schema change after confirming flush success.
3. Assume that the sequence after failover is to process a schema change event 
first, followed by a data change event.
On the schema manager side, the schema manager will apply the schema change 
event to its cached schema(i.e. evolvedSchema) after confirming a successful 
flush.
On the SinkWriterOperator side, the processing flow is:
1) Handle the flushEvent;
2) Handle the schema change event (in this step, the latest schema will be 
fetched from the schema manager and sent downstream; then the schema change 
event will be emitted) -- note that this step does not report an error;
3) Handle the data change -- here the failover occurs because the data record 
column size does not match the schema.

!image-2024-11-13-16-29-30-065.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to