[ https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann closed FLINK-19816. --------------------------------- Resolution: Fixed Fixed via 1.12.0: cb850fdda2b40866f3b0782e038ae4bce35c9eb0 1.11.3: a2925a0d2e894bf28aaced2993ec453589d143de > Flink restored from a wrong checkpoint (a very old one and not the last > completed one) > -------------------------------------------------------------------------------------- > > Key: FLINK-19816 > URL: https://issues.apache.org/jira/browse/FLINK-19816 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.0, 1.12.0 > Reporter: Steven Zhen Wu > Assignee: Till Rohrmann > Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0, 1.11.3 > > > h2. Summary > Upon failure, it seems that Flink didn't restore from the last completed > checkpoint. Instead, it restored from a very old checkpoint. As a result, > Kafka offsets are invalid and caused the job to replay from the beginning as > Kafka consumer "auto.offset.reset" was set to "EARLIEST". > This is an embarrassingly parallel stateless job. Parallelism is over 1,000. > I have the full log file from jobmanager at INFO level available upon request. > h2. Sequence of events from the logs > Just before the failure, checkpoint *210768* completed. > {code} > 2020-10-25 02:35:05,970 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [jobmanager-future-thread-5] - Completed checkpoint 210768 for job > 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms). > {code} > During restart, somehow it decided to restore from a very old checkpoint > *203531*. > {code:java} > 2020-10-25 02:36:03,301 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-3] - Start SessionDispatcherLeaderProcess. > 2020-10-25 02:36:03,302 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Recover all persisted job graphs. > 2020-10-25 02:36:03,304 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6 > 2020-10-25 02:36:03,307 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess > [cluster-io-thread-5] - Trying to recover job with job id > 233b4938179c06974e4535ac8a868675. > 2020-10-25 02:36:03,381 INFO com.netflix.bdp.s3fs.BdpS3FileSystem > [cluster-io-thread-25] - Deleting path: > s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd > ... > 2020-10-25 02:36:03,427 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Recovering checkpoints from > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Found 0 checkpoints in > ZooKeeper. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > [flink-akka.actor.default-dispatcher-82003] - Trying to fetch 0 checkpoints > from storage. > 2020-10-25 02:36:03,432 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator > [flink-akka.actor.default-dispatcher-82003] - Starting job > 233b4938179c06974e4535ac8a868675 from savepoint > s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata > () > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)