My team and I could use some help debugging the following issue, and may understanding Flink's full checkpoint recovery decision tree:
We've seen a few times a scenario where a task restarts (but not the job manager), a recent checkpoint is saved. But upon coming back up Flink chooses a much older checkpoint. Below is a log of one such event. In it checkpoint 64996 is written (the log indicates this, and checking S3 confirms, but the job restarts with 61634. Looking at the log I'm wondering: 1. Is it likely that Flink failed to update Zookeeper, despite writing the checkpoint to S3? 2. In the event where Flink fails to find an entry in Zookeeper, what is its fallback algorithm (where does it look next for a recovery point?) 3. It seems to ultimately have ended up in the checkpoint that existed at the time when the job started. Is there a configuration that would allow the fallback checkpoint to be something more recent? Thanks, Matt 2021/10/11 12:22:28.137 INFO c.i.strmprocess.ArgsPreprocessor - latestSavepointPrefix:desanalytics-7216-doc-comprehension-analytics-7216-prd/4/savepoints/savepoint-000000-abb450590ca7/_metadata LastModified:2021-10-11T12:03:47Z 2021/10/11 12:22:43.188 INFO o.a.f.r.c.CheckpointCoordinator - Starting job 00000000000000000000000000000000 from savepoint s3://spp-state-371299021277-tech-core-idx/desanalytics-7216-doc-comprehension-analytics-7216-prd/4/savepoints/savepoint-000000-abb450590ca7 (allowing non restored state) Starting standalonejob as a console application on host doc-comprehension-analytics-7216-7dbfbc4bbd-rzvpw. args: --job-id 00000000000000000000000000000000 --allowNonRestoredState --job-classname com.intuit.ifdp.doccomprehension.analytics.DocComprehensionAnalyticsProcessor --fromSavepoint s3://spp-state-371299021277-tech-core-idx/desanalytics-7216-doc-comprehension-analytics-7216-prd/4/savepoints/savepoint-000000-abb450590ca7 2021/10/11 12:22:51.777 INFO o.a.f.r.c.CheckpointCoordinator - Reset the checkpoint ID of job 00000000000000000000000000000000 to 61634. 2021/10/11 12:22:51.777 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2021/10/11 12:22:51.788 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Found 1 checkpoints in ZooKeeper. 2021/10/11 12:22:51.788 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Trying to fetch 1 checkpoints from storage. 2021/10/11 12:22:51.788 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 61633. 2021/10/11 12:22:51.895 INFO o.a.f.r.c.CheckpointCoordinator - Restoring job 00000000000000000000000000000000 from latest valid checkpoint: Checkpoint 61633 @ 0 for 00000000000000000000000000000000. <<<<<<<< long run here, with a few errors & recover >>>>>>>> NO restarts on JM 2021/10/13 23:10:37.918 INFO o.a.f.r.c.CheckpointCoordinator - Triggering checkpoint 64996 @ 1634166637914 for job 00000000000000000000000000000000. 2021/10/13 23:10:49.933 INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 64996 for job 00000000000000000000000000000000 (20015670 bytes in 11759 ms). 2021/10/13 23:10:59.200 INFO o.a.f.s.z.o.a.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x17c099f647f1e69, likely server has closed socket, closing socket connection and attempting reconnect 2021/10/13 23:10:59.301 INFO o.a.f.s.c.o.a.c.f.s.ConnectionStateManager - State change: SUSPENDED 2021/10/13 23:10:59.301 WARN o.a.f.r.l.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2021/10/13 23:10:59.323 INFO o.a.f.r.e.ExecutionGraph - Could not restart the job doc-comprehension-analytics-7216 (00000000000000000000000000000000) because the restart strategy prevented it. 2021/10/13 23:10:59.543 INFO o.a.f.r.z.ZooKeeperStateHandleStore - Removing /flink/prd/desanalytics-7216/doc-comprehension-analytics-7216/checkpoints/00000000000000000000000000000000 from ZooKeeper 2021/10/13 23:10:59.555 INFO o.a.f.r.e.ExecutionGraph - Job recovers via failover strategy: full graph restart 2021/10/13 23:10:59.622 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2021/10/13 23:10:59.643 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Found 0 checkpoints in ZooKeeper. 2021/10/13 23:10:59.643 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Trying to fetch 0 checkpoints from storage. 2021/10/13 23:10:59.643 INFO o.a.f.r.c.CheckpointCoordinator - Starting job 00000000000000000000000000000000 from savepoint s3://spp-state-371299021277-tech-core-idx/desanalytics-7216-doc-comprehension-analytics-7216-prd/4/savepoints/savepoint-000000-abb450590ca7 (allowing non restored state) 2021/10/13 23:10:59.744 INFO o.a.f.r.r.StandaloneResourceManager - Registering TaskManager with ResourceID 085da1dcfff1103b0f054dbd8a275357 (akka.ssl.tcp://flink@10.35.132.216:41531/user/taskmanager_0) at ResourceManager 2021/10/13 23:11:01.550 INFO o.a.f.r.r.StandaloneResourceManager - Ignoring outdated TaskExecutorGateway connection. 2021/10/13 23:11:01.550 INFO o.a.f.r.r.StandaloneResourceManager - Registering TaskManager with ResourceID 7606c3879f564a3d8936c6a1db86e611 (akka.ssl.tcp://flink@10.35.164.209:37357/user/taskmanager_0) at ResourceManager 2021/10/13 23:11:07.215 INFO o.a.f.r.c.CheckpointCoordinator - Reset the checkpoint ID of job 00000000000000000000000000000000 to 61634. 2021/10/13 23:11:07.215 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2021/10/13 23:11:07.227 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Found 1 checkpoints in ZooKeeper. 2021/10/13 23:11:07.227 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Trying to fetch 1 checkpoints from storage. 2021/10/13 23:11:07.227 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 61633. 2021/10/13 23:11:07.317 INFO o.a.f.r.c.CheckpointCoordinator - Restoring job 00000000000000000000000000000000 from latest valid checkpoint: Checkpoint 61633 @ 0 for 00000000000000000000000000000000. 2021/10/13 23:11:07.331 INFO o.a.f.r.c.CheckpointCoordinator - No master state to restore 2021/10/13 23:11:38.522 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Found 1 checkpoints in ZooKeeper. 2021/10/13 23:11:38.522 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Trying to fetch 1 checkpoints from storage. 2021/10/13 23:11:38.522 INFO o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 61634. 2021/10/13 23:11:38.531 INFO o.a.f.r.r.StandaloneResourceManager - Registering TaskManager with ResourceID 085da1dcfff1103b0f054dbd8a275357 (akka.ssl.tcp://flink@10.35.132.216:41531/user/taskmanager_0) at ResourceManager 2021/10/13 23:11:38.591 INFO o.a.f.r.c.CheckpointCoordinator - Restoring job 00000000000000000000000000000000 from latest valid checkpoint: Checkpoint 61634 @ 1634166669489 for 00000000000000000000000000000000. 2021/10/13 23:11:38.605 INFO o.a.f.r.c.CheckpointCoordinator - No master state to restore