Thanks for you reply, Yes, this is indeed an option. But I was more after a config option to handle that scenario. If the HA metadata points to a checkpoint that is obviously not present (err 404 in the S3 case) there is little value in retrying. The HA data are obviously worthless in that scenario.
But maybe there isn't any. Best regards JM ________________________________ From: Zhanghao Chen <zhanghao.c...@outlook.com> Sent: Tuesday, June 11, 2024 03:56 To: Jean-Marc Paulin <j...@uk.ibm.com>; user@flink.apache.org <user@flink.apache.org> Subject: [EXTERNAL] Re: Failed to resume from HA when the checkpoint has been deleted. Hi, In this case, you could cancel the job using the flink stop command, which will clean up Flink HA metadata, and resubmit the job. Best, Zhanghao Chen From: Jean-Marc Paulin <jmp@ uk. ibm. com> Sent: Monday, June 10, 2024 18: 53 To: user@ flink. apache. org Hi, In this case, you could cancel the job using the flink stop command, which will clean up Flink HA metadata, and resubmit the job. Best, Zhanghao Chen ________________________________ From: Jean-Marc Paulin <j...@uk.ibm.com> Sent: Monday, June 10, 2024 18:53 To: user@flink.apache.org <user@flink.apache.org> Subject: Failed to resume from HA when the checkpoint has been deleted. Hi, We have a 1.19 Flink streaming job, with HA enabled (ZooKeeper), checkpoint/savepoint in S3. We had an outage and now the jobmanager keeps restarting. We think it because it read the job id to be restarted from ZooKeeper, but because we lost our S3 Storage as part of the outage it cannot find the checkpoint to restart from, and dies. ``` Found 1 checkpoints in ZooKeeperStateHandleStore{namespace='flink/aiops/ir-lifecycle/jobs/2512c6153c7ae16fa6da6d64772d75c5/checkpoints' Trying to fetch 1 checkpoints from storage. Trying to retrieve checkpoint 50417. exception: JobMaster for job 2512c6153c7ae16fa6da6d64772d75c5 failed. Caused by: org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Failed to initialize high-availability completed checkpoint store ... Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 50417 from state handle under /0000000000000050417. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 17D7166A4D756355; S3 Extended Request ID: fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df; Proxy: null), S3 Extended Request ID: fe09be003d6379d952fad9de241c370b5f7ac43631c02fdfbc9dda9c4398d6df (Path: s3://test/high-availability/flink-job/completedCheckpoint64d901465702) Fatal error occurred in the cluster entrypoint. ``` Is there an option we can use to configure the job to ignore this error? Kind regards JM Unless otherwise stated above: IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU Unless otherwise stated above: IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU