[ https://issues.apache.org/jira/browse/FLINK-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tzu-Li (Gordon) Tai updated FLINK-9228: --------------------------------------- Fix Version/s: (was: 1.6.3) 1.6.4 > log details about task fail/task manager is shutting down > --------------------------------------------------------- > > Key: FLINK-9228 > URL: https://issues.apache.org/jira/browse/FLINK-9228 > Project: Flink > Issue Type: Improvement > Components: Logging > Affects Versions: 1.4.2 > Reporter: makeyang > Assignee: makeyang > Priority: Minor > Fix For: 1.6.4, 1.7.2, 1.8.0 > > > condition: > flink version:1.4.2 > jdk version:1.8.0.20 > linux version:3.10.0 > problem description: > one of my task manager is out of the cluster and I checked its log found > something below: > 2018-04-19 22:34:47,441 INFO org.apache.flink.runtime.taskmanager.Task > > - Attempting to fail task externally Process (115/120) > (19d0b0ce1ef3b8023b37bdfda643ef44). > 2018-04-19 22:34:47,441 INFO org.apache.flink.runtime.taskmanager.Task > > - Process (115/120) (19d0b0ce1ef3b8023b37bdfda643ef44) switched from RUNNING > to FAILED. > java.lang.Exception: TaskManager is shutting down. > at > org.apache.flink.runtime.taskmanager.TaskManager.postStop(TaskManager.scala:220) > > at akka.actor.Actor$class.aroundPostStop(Actor.scala:515) > at > org.apache.flink.runtime.taskmanager.TaskManager.aroundPostStop(TaskManager.scala:121) > > at > akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) > > at > akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) > at akka.actor.ActorCell.terminate(ActorCell.scala:374) > at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:467) > at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483) > at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > suggestion: > # short term suggestion: > ## log reasons why task tail?maybe received some event from job > manager/can't connect to job manager? operator exception? the more claritify > the better > ## log reasons why task manager is shutting down? received some event from > job manager/can't connect to job manager? operator exception can't be > recovery? > # long term suggestion: > ## define the state machine of flink node clearly. if nothing happens, the > node should stay what it used to be, which means if it is processing events, > if nothing happens, it should still processing events.or in other words, if > its state changes from processing event to cancel, then event happens. > ## define the events which can cause node state changed clearly. like use > cancel, operator exception, heart beat timeout etc > ## log the state change and event which cause state chaged clearly in logs > ## show event details(time, node, event, state changed etc) in webui -- This message was sent by Atlassian JIRA (v7.6.3#76005)