Nothing changed (as far as I know). It's the same binary and the same args. It's Flink v1.12.3. I'm going to switch away from auto-gen uids and see if that helps. The job randomly started failing to checkpoint. I cancelled the job and started it from the last successful checkpoint.
I'm confused why `811d3b279c8b26ed99ff0883b7630242` is used and not the auto-generated uid. That seems like a bug. On Tue, Dec 7, 2021 at 10:40 PM Robert Metzger <metrob...@gmail.com> wrote: > Hi Dan, > > When restoring a savepoint/checkpoint, Flink is matching the state for the > operators based on the uuid of the operator. The exception says that there > is some state that doesn't match any operator. So from Flink's perspective, > the operator is gone. > Here is more information: > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#assigning-operator-ids > > > Somehow something must have changed in your job: Did you change the Flink > version? > > Hope this helps! > > On Wed, Dec 8, 2021 at 5:49 AM Dan Hill <quietgol...@gmail.com> wrote: > >> I'm restoring the job with the same binary and same flags/args. >> >> On Tue, Dec 7, 2021 at 8:48 PM Dan Hill <quietgol...@gmail.com> wrote: >> >>> Hi. I noticed this warning has "operator >>> 811d3b279c8b26ed99ff0883b7630242" in it. I assume this should be an >>> operator uid or name. It looks like something else. What is it? Is >>> something corrupted? >>> >>> >>> org.apache.flink.runtime.client.JobInitializationException: Could not >>> instantiate JobManager. >>> at >>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:494) >>> at >>> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: java.lang.IllegalStateException: Failed to rollback to >>> checkpoint/savepoint >>> s3a://my-flink-state/checkpoints/ce9b90eafde97ca4629c13936c34426f/chk-626. >>> Cannot map checkpoint/savepoint state for operator >>> 811d3b279c8b26ed99ff0883b7630242 to the new program, because the operator >>> is not available in the new program. If you want to allow to skip this, you >>> can set the --allowNonRestoredState option on the CLI. >>> at >>> org.apache.flink.runtime.checkpoint.Checkpoints.throwNonRestoredStateException(Checkpoints.java:226) >>> at >>> org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:190) >>> at >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1627) >>> at >>> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362) >>> at >>> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292) >>> at >>> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249) >>> at >>> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:133) >>> at >>> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111) >>> at >>> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345) >>> at >>> org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:330) >>> at >>> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95) >>> at >>> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39) >>> at >>> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:162) >>> at >>> org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86) >>> at >>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478) >>> ... 4 more >>> >>>