Hi Averell, This problem is caused by a heartbeat timeout between JM and TM. You can locate it by: 1) Check the network status of the node at the time, such as whether the connection with other systems is equally problematic; 2) Check the tm log to see if there are more specific reasons; 3) View the load condition of the node that generated the timeout period; 4) Confirm whether there is a problem such as Full GC causing the JVM process to be stuck at the time;
Also, I don't know if you are using the default timeout, and if so, you can increase it appropriately. Thanks, vino. Averell <lvhu...@gmail.com> 于2018年8月27日周一 下午3:00写道: > Thank you Vino. > > I put the message in a tag, and I don't know why it was not shown in the > email thread. I paste the error message below in this email. > > Anyway, it seems that was an issue with enabling checkpointing. Now I am > able to get it turned on properly, and my job is getting restored > automatically. > I am trying to test my scenarios now. Found some issues, and I think it > would be better to ask in a separate thread. > > Thanks and regards, > Averell > > ===== > org.apache.flink.client.program.ProgramInvocationException: Job failed. > (JobID: 457d8f370ef8a50bb462946e1f12b80e) > at > > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:267) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487) > at > > org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66) > at > > org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:661) > ...... > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529) > at > > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421) > at > org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) > at > > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804) > at > org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280) > at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215) > at > > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044) > at > > org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120) > Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager > with id container_1535279282999_0032_01_000013 timed out. > at > > org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1610) > at > > org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > > org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) > at > > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >