This is the exception before the job went into cancelled state. But when I
looked into the task manager node, the flink process is still running.

java.lang.Exception: TaskManager was lost/killed:
383f6af3299793ba73eeb7bdbab0ddc7 @
ip-xx.xx.xxx.xx.us-west-2.compute.internal (dataPort=37652)
        at
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
        at
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
        at
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
        at
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
        at
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
        at org.apache.flink.runtime.jobmanager.JobManager.org
$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1202)
        at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1105)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
        at
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
        at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

On Fri, Mar 10, 2017 at 5:40 AM, Robert Metzger <rmetz...@apache.org> wrote:

> Hi,
>
> this error is only logged at WARN level. As Kaibo already said, its not a
> critical issue.
>
> Can you send some more messages from your log. Usually the Jobmanager logs
> why a taskmanager has failed. And the last few log messages of the failed
> TM itself are also often helpful.
>
>
>
> On Fri, Mar 10, 2017 at 10:46 AM, Kaibo Zhou <zkb...@gmail.com> wrote:
>
>> I think this is not the root cause of job failure, this task is caused by
>> other tasks failing. You can check the log of the first failed task.
>>
>> 2017-03-10 12:25 GMT+08:00 Govindarajan Srinivasaraghavan <
>> govindragh...@gmail.com>:
>>
>>> Hi All,
>>>
>>> I see the below error after running my streaming job for a while and
>>> when the load increases. After a while the task manager becomes completely
>>> dead and the job keeps on restarting.
>>>
>>> Also when I checked if there is an back pressure in the UI, it kept on
>>> saying sampling in progress and no results were displayed. Is there an API
>>> which can provide the back pressure details?
>>>
>>> 2017-03-10 01:40:58,793 WARN  org.apache.flink.streaming.ap
>>> i.operators.AbstractStreamOperator  - Error while emitting latency
>>> marker.
>>> org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException:
>>> Could not forward element to next operator
>>>         at org.apache.flink.streaming.runtime.tasks.OperatorChain$Chain
>>> ingOutput.emitLatencyMarker(OperatorChain.java:426)
>>>         at org.apache.flink.streaming.api.operators.AbstractStreamOpera
>>> tor$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:848)
>>>         at org.apache.flink.streaming.api.operators.StreamSource$Latenc
>>> yMarksEmitter$1.onProcessingTime(StreamSource.java:152)
>>>         at org.apache.flink.streaming.runtime.tasks.SystemProcessingTim
>>> eService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:256)
>>>         at java.util.concurrent.Executors$RunnableAdapter.call(Executor
>>> s.java:511)
>>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:
>>> 308)
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.run(ScheduledThreadPoolExecutor.java:294)
>>>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1142)
>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:617)
>>>         at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.lang.RuntimeException
>>>         at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emi
>>> tLatencyMarker(RecordWriterOutput.java:117)
>>>         at org.apache.flink.streaming.api.operators.AbstractStreamOpera
>>> tor$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:848)
>>>         at org.apache.flink.streaming.api.operators.AbstractStreamOpera
>>> tor.reportOrForwardLatencyMarker(AbstractStreamOperator.java:708)
>>>         at org.apache.flink.streaming.api.operators.AbstractStreamOpera
>>> tor.processLatencyMarker(AbstractStreamOperator.java:690)
>>>         at org.apache.flink.streaming.runtime.tasks.OperatorChain$Chain
>>> ingOutput.emitLatencyMarker(OperatorChain.java:423)
>>>         ... 10 more
>>> Caused by: java.lang.InterruptedException
>>>         at java.lang.Object.wait(Native Method)
>>>         at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.r
>>> equestBuffer(LocalBufferPool.java:168)
>>>         at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.r
>>> equestBufferBlocking(LocalBufferPool.java:138)
>>>         at org.apache.flink.runtime.io.network.api.writer.RecordWriter.
>>> sendToTarget(RecordWriter.java:132)
>>>         at org.apache.flink.runtime.io.network.api.writer.RecordWriter.
>>> randomEmit(RecordWriter.java:107)
>>>         at org.apache.flink.streaming.runtime.io.StreamRecordWriter.ran
>>> domEmit(StreamRecordWriter.java:104)
>>>         at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emi
>>> tLatencyMarker(RecordWriterOutput.java:114)
>>>         ... 14 more
>>>
>>>
>>>
>>
>

Reply via email to