[ https://issues.apache.org/jira/browse/FLINK-20654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Piotr Nowojski closed FLINK-20654. ---------------------------------- Release Note: Using UnalignedCheckpoints in Flink 1.12.0 combined with two/multiple inputs tasks or with union inputs for single input tasks can result in corrupted state. This can happen if a new checkpoint is triggered before recovery is fully completed. For state to be corrupted a task with two or more input gates receives a checkpoint barrier exactly at the same time this tasks finishes recovering spilled in-flight data. In such case this new checkpoint can succeed, with corrupted/missing in-flight data, which will result in various deserialisation/corrupted data stream errors when someone attempts to recover from such corrupted checkpoint. Resolution: Fixed merged commit 45f8577 into apache:release-1.12 merged commit c6786ab into apache:master > Unaligned checkpoint recovery may lead to corrupted data stream > --------------------------------------------------------------- > > Key: FLINK-20654 > URL: https://issues.apache.org/jira/browse/FLINK-20654 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.12.0 > Reporter: Arvid Heise > Assignee: Roman Khachatryan > Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.13.0, 1.12.1 > > > Fix of FLINK-20433 shows potential corruption after recovery for all > variations of UnalignedCheckpointITCase. > To reproduce, run UCITCase a couple hundreds times. The issue showed for me > in: > - execute [Parallel union, p = 5] > - execute [Parallel union, p = 10] > - execute [Parallel cogroup, p = 5] > - execute [parallel pipeline with remote channels, p = 5] > with decreasing frequency. > The issue manifests as one of the following issues: > - stream corrupted exception > - EOF exception > - assertion failure in NUM_LOST or NUM_OUT_OF_ORDER > - (for union) ArithmeticException overflow (because the number that should be > [0;100000] has been mis-deserialized) -- This message was sent by Atlassian Jira (v8.3.4#803005)