[ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arvid Heise updated FLINK-16931: -------------------------------- Affects Version/s: 1.12.0 > Large _metadata file lead to JobManager not responding when restart > ------------------------------------------------------------------- > > Key: FLINK-16931 > URL: https://issues.apache.org/jira/browse/FLINK-16931 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination > Affects Versions: 1.9.2, 1.10.0, 1.11.0, 1.12.0 > Reporter: Lu Niu > Assignee: Lu Niu > Priority: Critical > Fix For: 1.13.0 > > > When _metadata file is big, JobManager could never recover from checkpoint. > It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is > related log: > {code:java} > 2020-04-01 17:08:25,689 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found > 3 checkpoints in ZooKeeper. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to fetch 3 checkpoints from storage. > 2020-04-01 17:08:25,698 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 50. > 2020-04-01 17:08:48,589 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - > Trying to retrieve checkpoint 51. > 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The > heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out. > {code} > Digging into the code, looks like ExecutionGraph::restart runs in JobMaster > main thread and finally calls > ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download > file form DFS. The main thread is basically blocked for a while because of > this. One possible solution is to making the downloading part async. More > things might need to consider as the original change tries to make it > single-threaded. [https://github.com/apache/flink/pull/7568] -- This message was sent by Atlassian Jira (v8.3.4#803005)