Hi Averell,

This problem is caused by a heartbeat timeout between JM and TM. You can
locate it by:
1) Check the network status of the node at the time, such as whether the
connection with other systems is equally problematic;
2) Check the tm log to see if there are more specific reasons;
3) View the load condition of the node that generated the timeout period;
4) Confirm whether there is a problem such as Full GC causing the JVM
process to be stuck at the time;

Also, I don't know if you are using the default timeout, and if so, you can
increase it appropriately.

Thanks, vino.

Averell <lvhu...@gmail.com> 于2018年8月27日周一 下午3:00写道:

> Thank you Vino.
>
> I put the message in a  tag, and I don't know why it was not shown in the
> email thread. I paste the error message below in this email.
>
> Anyway, it seems that was an issue with enabling checkpointing. Now I am
> able to get it turned on properly, and my job is getting restored
> automatically.
> I am trying to test my scenarios now. Found some issues, and I think it
> would be better to ask in a separate thread.
>
> Thanks and regards,
> Averell
>
> =====
> org.apache.flink.client.program.ProgramInvocationException: Job failed.
> (JobID: 457d8f370ef8a50bb462946e1f12b80e)
>         at
>
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:267)
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487)
>         at
>
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>         at
>
> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:661)
> ......
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
>
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
>         at
>
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
>         at
>
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
>         at
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
>         at
> org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
>         at
>
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
>         at
>
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>         at
>
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>         at
> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
> Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager
> with id container_1535279282999_0032_01_000013 timed out.
>         at
>
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1610)
>         at
>
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
>
> org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)
>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Reply via email to