Hi Dan, When restoring a savepoint/checkpoint, Flink is matching the state for the operators based on the uuid of the operator. The exception says that there is some state that doesn't match any operator. So from Flink's perspective, the operator is gone. Here is more information: https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#assigning-operator-ids
Somehow something must have changed in your job: Did you change the Flink version? Hope this helps! On Wed, Dec 8, 2021 at 5:49 AM Dan Hill <quietgol...@gmail.com> wrote: > I'm restoring the job with the same binary and same flags/args. > > On Tue, Dec 7, 2021 at 8:48 PM Dan Hill <quietgol...@gmail.com> wrote: > >> Hi. I noticed this warning has "operator >> 811d3b279c8b26ed99ff0883b7630242" in it. I assume this should be an >> operator uid or name. It looks like something else. What is it? Is >> something corrupted? >> >> >> org.apache.flink.runtime.client.JobInitializationException: Could not >> instantiate JobManager. >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:494) >> at >> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.lang.IllegalStateException: Failed to rollback to >> checkpoint/savepoint >> s3a://my-flink-state/checkpoints/ce9b90eafde97ca4629c13936c34426f/chk-626. >> Cannot map checkpoint/savepoint state for operator >> 811d3b279c8b26ed99ff0883b7630242 to the new program, because the operator is >> not available in the new program. If you want to allow to skip this, you can >> set the --allowNonRestoredState option on the CLI. >> at >> org.apache.flink.runtime.checkpoint.Checkpoints.throwNonRestoredStateException(Checkpoints.java:226) >> at >> org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:190) >> at >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1627) >> at >> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362) >> at >> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292) >> at >> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249) >> at >> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:133) >> at >> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111) >> at >> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345) >> at >> org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:330) >> at >> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95) >> at >> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39) >> at >> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:162) >> at >> org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478) >> ... 4 more >> >>