This is the exception before the job went into cancelled state. But when I looked into the task manager node, the flink process is still running.
java.lang.Exception: TaskManager was lost/killed: 383f6af3299793ba73eeb7bdbab0ddc7 @ ip-xx.xx.xxx.xx.us-west-2.compute.internal (dataPort=37652) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org $apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1202) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1105) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) On Fri, Mar 10, 2017 at 5:40 AM, Robert Metzger <rmetz...@apache.org> wrote: > Hi, > > this error is only logged at WARN level. As Kaibo already said, its not a > critical issue. > > Can you send some more messages from your log. Usually the Jobmanager logs > why a taskmanager has failed. And the last few log messages of the failed > TM itself are also often helpful. > > > > On Fri, Mar 10, 2017 at 10:46 AM, Kaibo Zhou <zkb...@gmail.com> wrote: > >> I think this is not the root cause of job failure, this task is caused by >> other tasks failing. You can check the log of the first failed task. >> >> 2017-03-10 12:25 GMT+08:00 Govindarajan Srinivasaraghavan < >> govindragh...@gmail.com>: >> >>> Hi All, >>> >>> I see the below error after running my streaming job for a while and >>> when the load increases. After a while the task manager becomes completely >>> dead and the job keeps on restarting. >>> >>> Also when I checked if there is an back pressure in the UI, it kept on >>> saying sampling in progress and no results were displayed. Is there an API >>> which can provide the back pressure details? >>> >>> 2017-03-10 01:40:58,793 WARN org.apache.flink.streaming.ap >>> i.operators.AbstractStreamOperator - Error while emitting latency >>> marker. >>> org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: >>> Could not forward element to next operator >>> at org.apache.flink.streaming.runtime.tasks.OperatorChain$Chain >>> ingOutput.emitLatencyMarker(OperatorChain.java:426) >>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera >>> tor$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:848) >>> at org.apache.flink.streaming.api.operators.StreamSource$Latenc >>> yMarksEmitter$1.onProcessingTime(StreamSource.java:152) >>> at org.apache.flink.streaming.runtime.tasks.SystemProcessingTim >>> eService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:256) >>> at java.util.concurrent.Executors$RunnableAdapter.call(Executor >>> s.java:511) >>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java: >>> 308) >>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu >>> tureTask.access$301(ScheduledThreadPoolExecutor.java:180) >>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu >>> tureTask.run(ScheduledThreadPoolExecutor.java:294) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>> Executor.java:1142) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>> lExecutor.java:617) >>> at java.lang.Thread.run(Thread.java:745) >>> Caused by: java.lang.RuntimeException >>> at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emi >>> tLatencyMarker(RecordWriterOutput.java:117) >>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera >>> tor$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:848) >>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera >>> tor.reportOrForwardLatencyMarker(AbstractStreamOperator.java:708) >>> at org.apache.flink.streaming.api.operators.AbstractStreamOpera >>> tor.processLatencyMarker(AbstractStreamOperator.java:690) >>> at org.apache.flink.streaming.runtime.tasks.OperatorChain$Chain >>> ingOutput.emitLatencyMarker(OperatorChain.java:423) >>> ... 10 more >>> Caused by: java.lang.InterruptedException >>> at java.lang.Object.wait(Native Method) >>> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.r >>> equestBuffer(LocalBufferPool.java:168) >>> at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.r >>> equestBufferBlocking(LocalBufferPool.java:138) >>> at org.apache.flink.runtime.io.network.api.writer.RecordWriter. >>> sendToTarget(RecordWriter.java:132) >>> at org.apache.flink.runtime.io.network.api.writer.RecordWriter. >>> randomEmit(RecordWriter.java:107) >>> at org.apache.flink.streaming.runtime.io.StreamRecordWriter.ran >>> domEmit(StreamRecordWriter.java:104) >>> at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emi >>> tLatencyMarker(RecordWriterOutput.java:114) >>> ... 14 more >>> >>> >>> >> >