Hi Timo, we do have similar issue, TM got killed by a job. Is there a way to monitor JVM status? If through the monitor metrics, what metric I should look after? We are running Flink on K8S. Is there a possibility that a job consumes too much network bandwidth, so JM and TM can not connect?
On Tue, Apr 3, 2018 at 3:11 AM Timo Walther <twal...@apache.org> wrote: > Hi Miki, > > for me this sounds like your job has a resource leak such that your memory > fills up and the JVM of the TaskManager is killed at some point. How does > your job look like? I see a WindowedStream.apply which might not be > appropriate if you have big/frequent windows where the evaluation happens > too late such that the state becomes too big. > > Regards, > Timo > > > Am 03.04.18 um 08:26 schrieb miki haiat: > > i tried to run flink on kubernetes and as stand alone HA cluster and on > both cases task manger got lost/kill after few hours/days . > im using ubuntu and flink 1.4.2 . > > > this is part of the log , i also attaches the full log . > >> >> org.tlv.esb.StreamingJob$EsbTraceEvictor@20ffca60, >> WindowedStream.apply(WindowedStream.java:1061)) -> Sink: Unnamed (1/1) >> (91b27853aa30be93322d9c516ec266bf) switched from RUNNING to FAILED. >> java.lang.Exception: TaskManager was lost/killed: >> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587) >> at >> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) >> at >> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:523) >> at >> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) >> at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) >> at >> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) >> at org.apache.flink.runtime.jobmanager.JobManager.org >> <http://org.apache.flink.runtime.jobmanager.JobManager.org> >> $apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1198) >> at >> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096) >> at >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at >> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) >> at >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >> at >> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >> at >> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) >> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) >> at akka.actor.ActorCell.invoke(ActorCell.scala:494) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> 2018-04-02 13:09:01,727 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Flink >> Streaming esb correlate msg (0db04ff29124f59a123d4743d89473ed) switched >> from state RUNNING to FAILING. >> java.lang.Exception: TaskManager was lost/killed: >> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587) >> at >> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) >> at >> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:523) >> at >> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) >> at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) >> at >> org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) >> at org.apache.flink.runtime.jobmanager.JobManager.org >> <http://org.apache.flink.runtime.jobmanager.JobManager.org> >> $apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1198) >> at >> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096) >> at >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at >> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) >> at >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> at >> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >> at >> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >> at >> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) >> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) >> at akka.actor.ActorCell.invoke(ActorCell.scala:494) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> 2018-04-02 13:09:01,737 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom >> Source (1/1) (a10c25c2d3de57d33828524938fcfcc2) switched from RUNNING to >> CANCELING. > > > > > >