[ 
https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087849#comment-17087849
 ] 

Till Rohrmann commented on FLINK-16931:
---------------------------------------

Sorry for my late response. After looking at this issue I believe that the fix 
is far from trivial and requires a very thorough design.

There are actually two components which will be affected by this change.

1. Making the restore operation asynchronous in the {{CheckpointCoordinator}}
2. Enabling the scheduler to use an asynchronous state restore operation

The latter is strictly blocked on the {{CheckpointCoordinator}} work. 

I think that making the restore operation work asynchronously requires to 
handle other {{CheckpointCoordinator}} operations accordingly. For example, 
what happens to pending checkpoints which are concurrently completed while a 
restore operation happens? There is already a discussion about exactly this 
problem in FLINK-16770.

Once the {{CheckpointCoordinator}} can asynchronously retrieve the state to 
restore, it needs to be integrated into the new scheduler. Here the challenge 
is to handle concurrent scheduling operations properly. For example, while one 
waits for state to restore, a concurrent failover operation could be triggered. 
How is this handled and how is the potential scheduling conflict resolved?

It would be awesome if [~pnowojski] could take the lead on the 
{{CheckpointCoordinator}} changes since he was already involved in FLINK-13698. 
While working on the restore state method, it would actually be a good 
opportunity to change the {{CheckpointCoordinator}} so that it rather returns a 
set of state handles instead of directly working on {{Executions}}.

Once this is done, I will help to apply the scheduler changes. Fortunately, the 
scheduler already consists of multiple asynchronous stages which need to 
resolve conflicts originating from concurrent operations. Hence, I hope that 
another asynchronous stage of state restore might not be too difficult.

> Large _metadata file lead to JobManager not responding when restart
> -------------------------------------------------------------------
>
>                 Key: FLINK-16931
>                 URL: https://issues.apache.org/jira/browse/FLINK-16931
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.9.2, 1.10.0, 1.11.0
>            Reporter: Lu Niu
>            Assignee: Lu Niu
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> When _metadata file is big, JobManager could never recover from checkpoint. 
> It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is 
> related log: 
> {code:java}
>  2020-04-01 17:08:25,689 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
> Recovering checkpoints from ZooKeeper.
>  2020-04-01 17:08:25,698 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 
> 3 checkpoints in ZooKeeper.
>  2020-04-01 17:08:25,698 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
> Trying to fetch 3 checkpoints from storage.
>  2020-04-01 17:08:25,698 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
> Trying to retrieve checkpoint 50.
>  2020-04-01 17:08:48,589 INFO 
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - 
> Trying to retrieve checkpoint 51.
>  2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The 
> heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
> {code}
> Digging into the code, looks like ExecutionGraph::restart runs in JobMaster 
> main thread and finally calls 
> ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download 
> file form DFS. The main thread is basically blocked for a while because of 
> this. One possible solution is to making the downloading part async. More 
> things might need to consider as the original change tries to make it 
> single-threaded. [https://github.com/apache/flink/pull/7568]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to