Re: Flink Job Restarts if the metadata already exists for some Checkpoint

Weihua Hu Thu, 11 May 2023 00:50:21 -0700

Hi,

checkpoints are only used in failover for one job. Once a job is cancelled,
the related checkpoint-count metadata (stored on HA) will be removed.
But the checkpoint data could be retained if you configured it.


IIUC, the redeploy/update job will cancel the old job and then start a new
one.
They are two jobs for Flink, even if the jobid is the same. In this case,
the checkpoints
are retained, but ha data are removed. So a new job with the same jobID
will throw
an exception.

If you do not need to restore from the old checkpoint when re-deploy job,
you can disable retain checkpoints [1].
If you need to restore from the old checkpoint, you need to know the latest
checkpoint path and specify it in the start command[2].

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/#resuming-from-a-retained-checkpoint


Best,
Weihua


On Thu, May 11, 2023 at 1:43 PM amenreet sodhi <amenso...@gmail.com> wrote:

> Hey Hang,
>
> I am deploying my Flink Job in HA application mode, Whenever I redeploy my
> job, or deploy an updated version of the job, it's using the same job_id. I
> haven't configured anywhere to use a fixed job id, I think it's doing it by
> default. Can you share where I can configure this? I tried it once before,
> but couldn't find anything.
>
> Thanks
> Regards
> Amenreet Singh Sodhi
>
> On Wed, May 10, 2023 at 8:36 AM Hang Ruan <ruanhang1...@gmail.com> wrote:
>
>> Hi, amenreet,
>>
>> As Hangxiang said, we should use a new checkpoint dir if the new job has
>> the same jobId as the old one . Or else you should not use a fixed jobId
>> and the checkpoint dir will not conflict.
>>
>> Best,
>> Hang
>>
>> Hangxiang Yu <master...@gmail.com> 于2023年5月10日周三 10:35写道：
>>
>>> Hi,
>>> I guess you used a fixed JOB_ID, and configured the same checkpoint dir
>>> as before ?
>>> And you may also start the job without before state ?
>>> The new job cannot know anything about before checkpoints, that's why
>>> the new job will fail when it tries to generate a new checkpoint.
>>> I'd like to suggest you to use different JOB_ID for different jobs, or
>>> set a different checkpoint dir for a new job.
>>>
>>> On Tue, May 9, 2023 at 9:38 PM amenreet sodhi <amenso...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Is there any way to prevent restart of flink job, or override the
>>>> checkpoint metadata, if for some reason there exists a checkpoint by same
>>>> name. I get the following exception and my job restarts, have been trying
>>>> to find solution for a very long time but havent found anything useful yet,
>>>> other than manually cleaning.
>>>>
>>>> 2023-02-27 10:00:50,360 WARN  
>>>> org.apache.flink.runtime.checkpoint.CheckpointFailureManager
>>>> [] - Failed to trigger or complete checkpoint 1 for job
>>>> 000000006e6b13320000000000000000. (0 consecutive failed attempts so far)
>>>>
>>>> org.apache.flink.runtime.checkpoint.CheckpointException: Failure to
>>>> finalize checkpoint.
>>>>
>>>> at
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1375)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1265)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1157)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>>> [?:?]
>>>>
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>>> [?:?]
>>>>
>>>> at java.lang.Thread.run(Thread.java:834) [?:?]
>>>>
>>>> Caused by: java.io.IOException: Target file
>>>> file:/opt/flink/pm/checkpoint/000000006e6b13320000000000000000/chk-1/_metadata
>>>> already exists.
>>>>
>>>> at
>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.getOutputStreamWrapper(FsCheckpointMetadataOutputStream.java:168)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.<init>(FsCheckpointMetadataOutputStream.java:64)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:109)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:332)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1361)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> ... 7 more
>>>>
>>>> 2023-02-27 10:00:50,374 WARN  org.apache.flink.runtime.jobmaster.JobMaster
>>>>                 [] - Error while processing AcknowledgeCheckpoint
>>>> message
>>>>
>>>> org.apache.flink.runtime.checkpoint.CheckpointException: Could not
>>>> finalize the pending checkpoint 1. Failure reason: Failure to finalize
>>>> checkpoint.
>>>>
>>>> at
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.finalizeCheckpoint(CheckpointCoordinator.java:1381)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1265)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1157)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>>> [?:?]
>>>>
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>>> [?:?]
>>>>
>>>> at java.lang.Thread.run(Thread.java:834) [?:?]
>>>>
>>>> Caused by: java.io.IOException: Target file
>>>> file:/opt/flink/pm/checkpoint/000000006e6b13320000000000000000/chk-1/_metadata
>>>> already exists.
>>>>
>>>> at
>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.getOutputStreamWrapper(FsCheckpointMetadataOutputStream.java:168)
>>>> ~[event_executor-1.0-SNAPSHOT.jar:?]
>>>>
>>>>
>>>> Please let me know if anyone knows how to resolve this issue.
>>>>
>>>> Thanks and Regards
>>>>
>>>> Amenreet Singh Sodhi
>>>>
>>>>
>>>>
>>>
>>> --
>>> Best,
>>> Hangxiang.
>>>
>>

Re: Flink Job Restarts if the metadata already exists for some Checkpoint

Reply via email to