Thanks! On Tue, Dec 7, 2021, 22:55 Robert Metzger <metrob...@gmail.com> wrote:
> 811d3b279c8b26ed99ff0883b7630242 is the operator id. > If I'm not mistaken, running the job graph generation (e.g. the main > method) in DEBUG log level will show you all the IDs generated. This should > help you map this ID to your code. > > On Wed, Dec 8, 2021 at 7:52 AM Dan Hill <quietgol...@gmail.com> wrote: > >> Nothing changed (as far as I know). It's the same binary and the same >> args. It's Flink v1.12.3. I'm going to switch away from auto-gen uids and >> see if that helps. The job randomly started failing to checkpoint. I >> cancelled the job and started it from the last successful checkpoint. >> >> I'm confused why `811d3b279c8b26ed99ff0883b7630242` is used and not the >> auto-generated uid. That seems like a bug. >> >> On Tue, Dec 7, 2021 at 10:40 PM Robert Metzger <metrob...@gmail.com> >> wrote: >> >>> Hi Dan, >>> >>> When restoring a savepoint/checkpoint, Flink is matching the state for >>> the operators based on the uuid of the operator. The exception says that >>> there is some state that doesn't match any operator. So from Flink's >>> perspective, the operator is gone. >>> Here is more information: >>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#assigning-operator-ids >>> >>> >>> Somehow something must have changed in your job: Did you change the >>> Flink version? >>> >>> Hope this helps! >>> >>> On Wed, Dec 8, 2021 at 5:49 AM Dan Hill <quietgol...@gmail.com> wrote: >>> >>>> I'm restoring the job with the same binary and same flags/args. >>>> >>>> On Tue, Dec 7, 2021 at 8:48 PM Dan Hill <quietgol...@gmail.com> wrote: >>>> >>>>> Hi. I noticed this warning has "operator >>>>> 811d3b279c8b26ed99ff0883b7630242" in it. I assume this should be an >>>>> operator uid or name. It looks like something else. What is it? Is >>>>> something corrupted? >>>>> >>>>> >>>>> org.apache.flink.runtime.client.JobInitializationException: Could not >>>>> instantiate JobManager. >>>>> at >>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:494) >>>>> at >>>>> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>> at >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>> at java.lang.Thread.run(Thread.java:748) >>>>> Caused by: java.lang.IllegalStateException: Failed to rollback to >>>>> checkpoint/savepoint >>>>> s3a://my-flink-state/checkpoints/ce9b90eafde97ca4629c13936c34426f/chk-626. >>>>> Cannot map checkpoint/savepoint state for operator >>>>> 811d3b279c8b26ed99ff0883b7630242 to the new program, because the operator >>>>> is not available in the new program. If you want to allow to skip this, >>>>> you can set the --allowNonRestoredState option on the CLI. >>>>> at >>>>> org.apache.flink.runtime.checkpoint.Checkpoints.throwNonRestoredStateException(Checkpoints.java:226) >>>>> at >>>>> org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:190) >>>>> at >>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1627) >>>>> at >>>>> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362) >>>>> at >>>>> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292) >>>>> at >>>>> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249) >>>>> at >>>>> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:133) >>>>> at >>>>> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:111) >>>>> at >>>>> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345) >>>>> at >>>>> org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:330) >>>>> at >>>>> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:95) >>>>> at >>>>> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39) >>>>> at >>>>> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:162) >>>>> at >>>>> org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:86) >>>>> at >>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:478) >>>>> ... 4 more >>>>> >>>>>