Hi,

We have NullPointerException when trying to restore from savepoint for
the same jar, or different jar, or different parallelism.  We have
workaround this issue by changing UIDs in almost all operators. We want
to know if there is anyway to avoid this problem so that it will not
cause service maintence problem, thanks.


The following is redacted stack trace we can provide for now:

    2021-06-09 13:08:59,849 WARN  
org.apache.flink.client.deployment.application.DetachedApplicationRunner [] - 
Could not execute application:
    org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Failed to execute job '<censored>'.
            at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) 
~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:84)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:70)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:102)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
 [?:?]
            at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
            at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
            at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 [?:?]
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
            at java.lang.Thread.run(Thread.java:834) [?:?]
    Caused by: org.apache.flink.util.FlinkException: Failed to execute job 
'<censored>'.
            at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1970)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:135)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1834)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:801)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at <censored> ~[?:?]
            at <censored> ~[?:?]
            at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method) ~[?:?]
            at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]
            at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
            at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
            at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            ... 12 more
    Caused by: java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
JobMaster.
            at 
org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316) 
~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
 ~[?:?]
            at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 ~[?:?]
            ... 1 more
    Caused by: org.apache.flink.runtime.client.JobInitializationException: 
Could not start the JobMaster.
            at 
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]
            at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
 ~[?:?]
            at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]
            at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705)
 ~[?:?]
            ... 6 more
    Caused by: java.util.concurrent.CompletionException: 
java.lang.NullPointerException
            at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
 ~[?:?]
            at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
 ~[?:?]
            at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
 ~[?:?]
            ... 6 more
    Caused by: java.lang.NullPointerException
            at 
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.reAssignSubKeyedStates(StateAssignmentOperation.java:300)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.lambda$reDistributeKeyedStates$0(StateAssignmentOperation.java:260)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at java.util.HashMap.forEach(HashMap.java:1336) ~[?:?]
            at 
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.reDistributeKeyedStates(StateAssignmentOperation.java:252)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignAttemptState(StateAssignmentOperation.java:196)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignStates(StateAssignmentOperation.java:139)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1562)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1642)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:163)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:138)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:342)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190) 
~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317) 
~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
            ... 7 more
    2021-06-09 13:08:59,852 ERROR 
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler   [] - Exception 
occurred in REST handler: Could not execute application.


-- 
ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
http://czchen.info/
Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B

Attachment: signature.asc
Description: PGP signature

Reply via email to