Nothing changed (as far as I know).  It's the same binary and the same
args.  It's Flink v1.12.3.  I'm going to switch away from auto-gen uids and
see if that helps.  The job randomly started failing to checkpoint.  I
cancelled the job and started it from the last successful checkpoint.

I'm confused why `811d3b279c8b26ed99ff0883b7630242` is used and not the
auto-generated uid.  That seems like a bug.

On Tue, Dec 7, 2021 at 10:40 PM Robert Metzger <metrob...@gmail.com> wrote:

> Hi Dan,
>
> When restoring a savepoint/checkpoint, Flink is matching the state for the
> operators based on the uuid of the operator. The exception says that there
> is some state that doesn't match any operator. So from Flink's perspective,
> the operator is gone.
> Here is more information:
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#assigning-operator-ids
>
>
> Somehow something must have changed in your job: Did you change the Flink
> version?
>
> Hope this helps!
>
> On Wed, Dec 8, 2021 at 5:49 AM Dan Hill <quietgol...@gmail.com> wrote:
>
>> I'm restoring the job with the same binary and same flags/args.
>>
>> On Tue, Dec 7, 2021 at 8:48 PM Dan Hill <quietgol...@gmail.com> wrote:
>>
>>> Hi.  I noticed this warning has "operator
>>> 811d3b279c8b26ed99ff0883b7630242" in it.  I assume this should be an
>>> operator uid or name.  It looks like something else.  What is it?  Is
>>> something corrupted?
>>>
>>>
>>> org.apache.flink.runtime.client.JobInitializationException: Could not 
>>> instantiate JobManager.
>>>     at 
>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:494)
>>>     at 
>>> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>     at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.lang.IllegalStateException: Failed to rollback to 
>>> checkpoint/savepoint 
>>> s3a://my-flink-state/checkpoints/ce9b90eafde97ca4629c13936c34426f/chk-626. 
>>> Cannot map checkpoint/savepoint state for operator 
>>> 811d3b279c8b26ed99ff0883b7630242 to the new program, because the operator 
>>> is not available in the new program. If you want to allow to skip this, you 
>>> can set the --allowNonRestoredState option on the CLI.
>>>     at 
>>> org.apache.flink.runtime.checkpoint.Checkpoints.throwNonRestoredStateException(Checkpoints.java:226)
>>>     at 
>>> org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:190)
>>>     at 
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1627)
>>>     at 
>>> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362)
>>>     at 
>>> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292)
>>>     at 
>>> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249)
>>>     at 
>>> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:133)
>>>     at 
>>> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111)
>>>     at 
>>> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345)
>>>     at 
>>> org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:330)
>>>     at 
>>> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95)
>>>     at 
>>> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39)
>>>     at 
>>> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:162)
>>>     at 
>>> org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86)
>>>     at 
>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478)
>>>     ... 4 more
>>>
>>>

Reply via email to