[ https://issues.apache.org/jira/browse/FLINK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-17869: ----------------------------------- Labels: pull-request-available (was: ) > Fix the race condition of aborting unaligned checkpoint > ------------------------------------------------------- > > Key: FLINK-17869 > URL: https://issues.apache.org/jira/browse/FLINK-17869 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Reporter: Zhijiang > Assignee: Roman Khachatryan > Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.0 > > > On ChannelStateWriter side, the lifecycle of checkpoint should be as follows: > start -> in progress/abort -> stop. > The ChannelStateWriteResult is created during #start, and removed by #abort > or #stop processes. There are some potential race conditions here: > * #start is called while receiving the first barrier by netty thread and > schedule to execute the checkpoint > * The task thread might process cancel checkpoint and call #abort before > performing the above respective checkpoint > * The checkpoint can still be executed by task thread afterwards even > thought the above abort happened before, because we can not remove the > checkpoint action from mailbox during aborting. > * While checkpoint executing, it will call > `ChannelStateWriter#getWriteResult` then it would cause > `IllegalStateException` because the respective result was already removed in > advance during handling #abort method before. > * Therefore it will cause unnecessary task failure during performing > checkpoint > I guess we do not want to fail the task when one checkpoint is aborted by > design. And the illegal state check during ChannelStateWriter#getWriteResult > was mainly proposed for normal process validation I guess. > If we do not remove the ChannelStateWriteResult while handling #abort and > rely on #stop to remove it, then it might probably exist another scenario > that the checkpoint will never be performed after #start (we have another > mechanism to exit the triggering checkpoint in advance if the abort is sent > by CheckpointCoordinator), then the legacy ChannelStateWriteResult will be > retained inside ChannelStateWriter long time. > Maybe the potential option to fix this issue is to let > SubtaskCheckpointCoordinatorImpl handle the exception from > ChannelStateWriter#getWriteResult properly to not fail the task in the > aborted case. -- This message was sent by Atlassian Jira (v8.3.4#803005)