Hi, I guess you used a fixed JOB_ID, and configured the same checkpoint dir as before ? And you may also start the job without before state ? The new job cannot know anything about before checkpoints, that's why the new job will fail when it tries to generate a new checkpoint. I'd like to suggest you to use different JOB_ID for different jobs, or set a different checkpoint dir for a new job.
On Tue, May 9, 2023 at 9:38 PM amenreet sodhi <amenso...@gmail.com> wrote: > Hi all, > > Is there any way to prevent restart of flink job, or override the > checkpoint metadata, if for some reason there exists a checkpoint by same > name. I get the following exception and my job restarts, have been trying > to find solution for a very long time but havent found anything useful yet, > other than manually cleaning. > > 2023-02-27 10:00:50,360 WARN > org.apache.flink.runtime.checkpoint.CheckpointFailureManager > [] - Failed to trigger or complete checkpoint 1 for job > 000000006e6b13320000000000000000. (0 consecutive failed attempts so far) > > org.apache.flink.runtime.checkpoint.CheckpointException: Failure to > finalize checkpoint. > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1375) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1265) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1157) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > > at java.lang.Thread.run(Thread.java:834) [?:?] > > Caused by: java.io.IOException: Target file > file:/opt/flink/pm/checkpoint/000000006e6b13320000000000000000/chk-1/_metadata > already exists. > > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.getOutputStreamWrapper(FsCheckpointMetadataOutputStream.java:168) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.<init>(FsCheckpointMetadataOutputStream.java:64) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:109) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:332) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1361) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > ... 7 more > > 2023-02-27 10:00:50,374 WARN org.apache.flink.runtime.jobmaster.JobMaster > [] - Error while processing AcknowledgeCheckpoint message > > org.apache.flink.runtime.checkpoint.CheckpointException: Could not > finalize the pending checkpoint 1. Failure reason: Failure to finalize > checkpoint. > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1381) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1265) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1157) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > > at java.lang.Thread.run(Thread.java:834) [?:?] > > Caused by: java.io.IOException: Target file > file:/opt/flink/pm/checkpoint/000000006e6b13320000000000000000/chk-1/_metadata > already exists. > > at > org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.getOutputStreamWrapper(FsCheckpointMetadataOutputStream.java:168) > ~[event_executor-1.0-SNAPSHOT.jar:?] > > > Please let me know if anyone knows how to resolve this issue. > > Thanks and Regards > > Amenreet Singh Sodhi > > > -- Best, Hangxiang.