Hi, checkpoints are only used in failover for one job. Once a job is cancelled, the related checkpoint-count metadata (stored on HA) will be removed. But the checkpoint data could be retained if you configured it.
IIUC, the redeploy/update job will cancel the old job and then start a new one. They are two jobs for Flink, even if the jobid is the same. In this case, the checkpoints are retained, but ha data are removed. So a new job with the same jobID will throw an exception. If you do not need to restore from the old checkpoint when re-deploy job, you can disable retain checkpoints [1]. If you need to restore from the old checkpoint, you need to know the latest checkpoint path and specify it in the start command[2]. [1] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints [2] https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/#resuming-from-a-retained-checkpoint Best, Weihua On Thu, May 11, 2023 at 1:43 PM amenreet sodhi <amenso...@gmail.com> wrote: > Hey Hang, > > I am deploying my Flink Job in HA application mode, Whenever I redeploy my > job, or deploy an updated version of the job, it's using the same job_id. I > haven't configured anywhere to use a fixed job id, I think it's doing it by > default. Can you share where I can configure this? I tried it once before, > but couldn't find anything. > > Thanks > Regards > Amenreet Singh Sodhi > > On Wed, May 10, 2023 at 8:36 AM Hang Ruan <ruanhang1...@gmail.com> wrote: > >> Hi, amenreet, >> >> As Hangxiang said, we should use a new checkpoint dir if the new job has >> the same jobId as the old one . Or else you should not use a fixed jobId >> and the checkpoint dir will not conflict. >> >> Best, >> Hang >> >> Hangxiang Yu <master...@gmail.com> 于2023年5月10日周三 10:35写道: >> >>> Hi, >>> I guess you used a fixed JOB_ID, and configured the same checkpoint dir >>> as before ? >>> And you may also start the job without before state ? >>> The new job cannot know anything about before checkpoints, that's why >>> the new job will fail when it tries to generate a new checkpoint. >>> I'd like to suggest you to use different JOB_ID for different jobs, or >>> set a different checkpoint dir for a new job. >>> >>> On Tue, May 9, 2023 at 9:38 PM amenreet sodhi <amenso...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> Is there any way to prevent restart of flink job, or override the >>>> checkpoint metadata, if for some reason there exists a checkpoint by same >>>> name. I get the following exception and my job restarts, have been trying >>>> to find solution for a very long time but havent found anything useful yet, >>>> other than manually cleaning. >>>> >>>> 2023-02-27 10:00:50,360 WARN >>>> org.apache.flink.runtime.checkpoint.CheckpointFailureManager >>>> [] - Failed to trigger or complete checkpoint 1 for job >>>> 000000006e6b13320000000000000000. (0 consecutive failed attempts so far) >>>> >>>> org.apache.flink.runtime.checkpoint.CheckpointException: Failure to >>>> finalize checkpoint. >>>> >>>> at >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1375) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1265) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1157) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >>>> [?:?] >>>> >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >>>> [?:?] >>>> >>>> at java.lang.Thread.run(Thread.java:834) [?:?] >>>> >>>> Caused by: java.io.IOException: Target file >>>> file:/opt/flink/pm/checkpoint/000000006e6b13320000000000000000/chk-1/_metadata >>>> already exists. >>>> >>>> at >>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.getOutputStreamWrapper(FsCheckpointMetadataOutputStream.java:168) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.<init>(FsCheckpointMetadataOutputStream.java:64) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:109) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:332) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1361) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> ... 7 more >>>> >>>> 2023-02-27 10:00:50,374 WARN org.apache.flink.runtime.jobmaster.JobMaster >>>> [] - Error while processing AcknowledgeCheckpoint >>>> message >>>> >>>> org.apache.flink.runtime.checkpoint.CheckpointException: Could not >>>> finalize the pending checkpoint 1. Failure reason: Failure to finalize >>>> checkpoint. >>>> >>>> at >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1381) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1265) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1157) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >>>> [?:?] >>>> >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >>>> [?:?] >>>> >>>> at java.lang.Thread.run(Thread.java:834) [?:?] >>>> >>>> Caused by: java.io.IOException: Target file >>>> file:/opt/flink/pm/checkpoint/000000006e6b13320000000000000000/chk-1/_metadata >>>> already exists. >>>> >>>> at >>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.getOutputStreamWrapper(FsCheckpointMetadataOutputStream.java:168) >>>> ~[event_executor-1.0-SNAPSHOT.jar:?] >>>> >>>> >>>> Please let me know if anyone knows how to resolve this issue. >>>> >>>> Thanks and Regards >>>> >>>> Amenreet Singh Sodhi >>>> >>>> >>>> >>> >>> -- >>> Best, >>> Hangxiang. >>> >>