[ https://issues.apache.org/jira/browse/FLINK-25486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Pohl updated FLINK-25486: ---------------------------------- Priority: Blocker (was: Critical) > Perjob can not recover from checkpoint when zookeeper leader changes > -------------------------------------------------------------------- > > Key: FLINK-25486 > URL: https://issues.apache.org/jira/browse/FLINK-25486 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.15.0, 1.13.5, 1.14.2 > Reporter: Liu > Assignee: Liu > Priority: Blocker > Labels: pull-request-available > Fix For: 1.15.0, 1.13.6, 1.14.4 > > > When the config > high-availability.zookeeper.client.tolerate-suspended-connections is default > false, the appMaster will failover once zk leader changes. In this case, the > old appMaster will clean up all the zk info and the new appMaster will not > recover from the latest checkpoint. > The process is as following: > # Start a perJob application. > # kill zk's leade node which cause the perJob to suspend. > # In MiniDispatcher's function jobReachedTerminalState, shutDownFuture is > set to UNKNOWN . > # The future is transferred to ClusterEntrypoint, the method is called with > cleanupHaData true. > # Clean up zk data and exit. > # The new appMaster will not find any checkpoints to start and the state is > lost. > Since the job can recover automatically when the zk leader changes, it is > reasonable to keep zk info for the coming recovery. > -- This message was sent by Atlassian Jira (v8.20.1#820001)