Hi, The issue might be related to garbage collection pauses during which the TM JVM cannot communicate with the JM. The metrics contain a stats for memory consumpion [1] and GC activity [2] that can help to diagnose the problem.
Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/metrics.html#memory [2] https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/metrics.html#garbagecollection 2018-04-04 8:30 GMT+02:00 miki haiat <miko5...@gmail.com>: > HI , > > i checked the code again the figure out where the problem can be > > i just wondered if im implementing the Evictor correctly ? > > full code > https://gist.github.com/miko-code/6d7010505c3cb95be122364b29057237 > > > > > public static class EsbTraceEvictor implements Evictor<EsbTrace, > GlobalWindow> { > org.slf4j.Logger LOG = LoggerFactory.getLogger(EsbTraceEvictor.class); > @Override > public void evictBefore(Iterable<TimestampedValue<EsbTrace>> iterable, > int i, GlobalWindow globalWindow, Evictor.EvictorContext evictorContext) { > > } > > @Override > public void evictAfter(Iterable<TimestampedValue<EsbTrace>> elements, int > i, GlobalWindow globalWindow, EvictorContext evictorContext) { > //change it to current procces time > long min5min = LocalDateTime.now().minusMinutes(5).getNano(); > LOG.info("time now -5min",min5min); > DateTimeFormatter format = DateTimeFormatter.ISO_DATE_TIME; > for (Iterator<TimestampedValue<EsbTrace>> iterator = > elements.iterator(); iterator.hasNext(); ) { > TimestampedValue<EsbTrace> element = iterator.next(); > LocalDateTime el = > LocalDateTime.parse(element.getValue().getEndDate(),format); > LOG.info("element time ",element.getValue().getEndDate()); > if (el.minusMinutes(5).getNano() <= min5min) { > iterator.remove(); > } > } > } > } > > > > > > > On Tue, Apr 3, 2018 at 4:28 PM, Hao Sun <ha...@zendesk.com> wrote: > >> Hi Timo, we do have similar issue, TM got killed by a job. Is there a way >> to monitor JVM status? If through the monitor metrics, what metric I should >> look after? >> We are running Flink on K8S. Is there a possibility that a job consumes >> too much network bandwidth, so JM and TM can not connect? >> >> On Tue, Apr 3, 2018 at 3:11 AM Timo Walther <twal...@apache.org> wrote: >> >>> Hi Miki, >>> >>> for me this sounds like your job has a resource leak such that your >>> memory fills up and the JVM of the TaskManager is killed at some point. How >>> does your job look like? I see a WindowedStream.apply which might not be >>> appropriate if you have big/frequent windows where the evaluation happens >>> too late such that the state becomes too big. >>> >>> Regards, >>> Timo >>> >>> >>> Am 03.04.18 um 08:26 schrieb miki haiat: >>> >>> i tried to run flink on kubernetes and as stand alone HA cluster and on >>> both cases task manger got lost/kill after few hours/days . >>> im using ubuntu and flink 1.4.2 . >>> >>> >>> this is part of the log , i also attaches the full log . >>> >>>> >>>> org.tlv.esb.StreamingJob$EsbTraceEvictor@20ffca60, >>>> WindowedStream.apply(WindowedStream.java:1061)) -> Sink: Unnamed (1/1) >>>> (91b27853aa30be93322d9c516ec266bf) switched from RUNNING to FAILED. >>>> java.lang.Exception: TaskManager was lost/killed: >>>> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587) >>>> at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot( >>>> SimpleSlot.java:217) >>>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment >>>> .releaseSharedSlot(SlotSharingGroupAssignment.java:523) >>>> at org.apache.flink.runtime.instance.SharedSlot.releaseSlot( >>>> SharedSlot.java:192) >>>> at org.apache.flink.runtime.instance.Instance.markDead(Instance >>>> .java:167) >>>> at org.apache.flink.runtime.instance.InstanceManager.unregister >>>> TaskManager(InstanceManager.java:212) >>>> at org.apache.flink.runtime.jobmanager.JobManager.org$apache$ >>>> flink$runtime$jobmanager$JobManager$$handleTaskManagerT >>>> erminated(JobManager.scala:1198) >>>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$ >>>> handleMessage$1.applyOrElse(JobManager.scala:1096) >>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF >>>> unction.scala:36) >>>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun >>>> $receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) >>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF >>>> unction.scala:36) >>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag >>>> es.scala:33) >>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag >>>> es.scala:28) >>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>>> at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse( >>>> LogMessages.scala:28) >>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >>>> at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive >>>> (JobManager.scala:122) >>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >>>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated(Death >>>> Watch.scala:46) >>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) >>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) >>>> at akka.actor.ActorCell.invoke(ActorCell.scala:494) >>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask( >>>> ForkJoinPool.java:1339) >>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo >>>> l.java:1979) >>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW >>>> orkerThread.java:107) >>>> 2018-04-02 13:09:01,727 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>> - Job Flink Streaming esb correlate msg (0db04ff29124f59a123d4743d89473ed) >>>> switched from state RUNNING to FAILING. >>>> java.lang.Exception: TaskManager was lost/killed: >>>> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587) >>>> at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot( >>>> SimpleSlot.java:217) >>>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment >>>> .releaseSharedSlot(SlotSharingGroupAssignment.java:523) >>>> at org.apache.flink.runtime.instance.SharedSlot.releaseSlot( >>>> SharedSlot.java:192) >>>> at org.apache.flink.runtime.instance.Instance.markDead(Instance >>>> .java:167) >>>> at org.apache.flink.runtime.instance.InstanceManager.unregister >>>> TaskManager(InstanceManager.java:212) >>>> at org.apache.flink.runtime.jobmanager.JobManager.org$apache$ >>>> flink$runtime$jobmanager$JobManager$$handleTaskManagerT >>>> erminated(JobManager.scala:1198) >>>> at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$ >>>> handleMessage$1.applyOrElse(JobManager.scala:1096) >>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF >>>> unction.scala:36) >>>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun >>>> $receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) >>>> at scala.runtime.AbstractPartialFunction.apply(AbstractPartialF >>>> unction.scala:36) >>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag >>>> es.scala:33) >>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessag >>>> es.scala:28) >>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>>> at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse( >>>> LogMessages.scala:28) >>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >>>> at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive >>>> (JobManager.scala:122) >>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >>>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated(Death >>>> Watch.scala:46) >>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) >>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) >>>> at akka.actor.ActorCell.invoke(ActorCell.scala:494) >>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask( >>>> ForkJoinPool.java:1339) >>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo >>>> l.java:1979) >>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW >>>> orkerThread.java:107) >>>> 2018-04-02 13:09:01,737 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>> - Source: Custom Source (1/1) (a10c25c2d3de57d33828524938fcfcc2) >>>> switched from RUNNING to CANCELING. >>> >>> >>> >>> >>> >>> >