makeyang created FLINK-9228: ------------------------------- Summary: log details about task fail/task manager is shutting down Key: FLINK-9228 URL: https://issues.apache.org/jira/browse/FLINK-9228 Project: Flink Issue Type: Improvement Components: Logging Affects Versions: 1.4.2 Reporter: makeyang Assignee: makeyang Fix For: 1.4.3, 1.5.1
condition: flink version:1.4.2 jdk version:1.8.0.20 linux version:3.10.0 problem description: one of my task manager is out of the cluster and I checked its log found something below: 2018-04-19 22:34:47,441 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Process (115/120) (19d0b0ce1ef3b8023b37bdfda643ef44). 2018-04-19 22:34:47,441 INFO org.apache.flink.runtime.taskmanager.Task - Process (115/120) (19d0b0ce1ef3b8023b37bdfda643ef44) switched from RUNNING to FAILED. java.lang.Exception: TaskManager is shutting down. at org.apache.flink.runtime.taskmanager.TaskManager.postStop(TaskManager.scala:220) at akka.actor.Actor$class.aroundPostStop(Actor.scala:515) at org.apache.flink.runtime.taskmanager.TaskManager.aroundPostStop(TaskManager.scala:121) at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210) at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172) at akka.actor.ActorCell.terminate(ActorCell.scala:374) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:467) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) suggestion: # short term suggestion: ## log reasons why task tail?maybe received some event from job manager/can't connect to job manager? operator exception? the more claritify the better ## log reasons why task manager is shutting down? received some event from job manager/can't connect to job manager? operator exception can't be recovery? # long term suggestion: ## define the state machine of flink node clearly. if nothing happens, the node should stay what it used to be, which means if it is processing events, if nothing happens, it should still processing events.or in other words, if its state changes from processing event to cancel, then event happens. ## define the events which can cause node state changed clearly. like use cancel, operator exception, heart beat timeout etc ## log the state change and event which cause state chaged clearly in logs ## show event details(time, node, event, state changed etc) in webui -- This message was sent by Atlassian JIRA (v7.6.3#76005)