HI , i checked the code again the figure out where the problem can be
i just wondered if im implementing the Evictor correctly ? full code https://gist.github.com/miko-code/6d7010505c3cb95be122364b29057237 public static class EsbTraceEvictor implements Evictor<EsbTrace, GlobalWindow> { org.slf4j.Logger LOG = LoggerFactory.getLogger(EsbTraceEvictor.class); @Override public void evictBefore(Iterable<TimestampedValue<EsbTrace>> iterable, int i, GlobalWindow globalWindow, Evictor.EvictorContext evictorContext) { } @Override public void evictAfter(Iterable<TimestampedValue<EsbTrace>> elements, int i, GlobalWindow globalWindow, EvictorContext evictorContext) { //change it to current procces time long min5min = LocalDateTime.now().minusMinutes(5).getNano(); LOG.info("time now -5min",min5min); DateTimeFormatter format = DateTimeFormatter.ISO_DATE_TIME; for (Iterator<TimestampedValue<EsbTrace>> iterator = elements.iterator(); iterator.hasNext(); ) { TimestampedValue<EsbTrace> element = iterator.next(); LocalDateTime el = LocalDateTime.parse(element.getValue().getEndDate(),format); LOG.info("element time ",element.getValue().getEndDate()); if (el.minusMinutes(5).getNano() <= min5min) { iterator.remove(); } } } } On Tue, Apr 3, 2018 at 4:28 PM, Hao Sun <ha...@zendesk.com> wrote: > Hi Timo, we do have similar issue, TM got killed by a job. Is there a way > to monitor JVM status? If through the monitor metrics, what metric I should > look after? > We are running Flink on K8S. Is there a possibility that a job consumes > too much network bandwidth, so JM and TM can not connect? > > On Tue, Apr 3, 2018 at 3:11 AM Timo Walther <twal...@apache.org> wrote: > >> Hi Miki, >> >> for me this sounds like your job has a resource leak such that your >> memory fills up and the JVM of the TaskManager is killed at some point. How >> does your job look like? I see a WindowedStream.apply which might not be >> appropriate if you have big/frequent windows where the evaluation happens >> too late such that the state becomes too big. >> >> Regards, >> Timo >> >> >> Am 03.04.18 um 08:26 schrieb miki haiat: >> >> i tried to run flink on kubernetes and as stand alone HA cluster and on >> both cases task manger got lost/kill after few hours/days . >> im using ubuntu and flink 1.4.2 . >> >> >> this is part of the log , i also attaches the full log . >> >>> >>> org.tlv.esb.StreamingJob$EsbTraceEvictor@20ffca60, >>> WindowedStream.apply(WindowedStream.java:1061)) >>> -> Sink: Unnamed (1/1) (91b27853aa30be93322d9c516ec266bf) switched from >>> RUNNING to FAILED. >>> java.lang.Exception: TaskManager was lost/killed: >>> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587) >>> at org.apache.flink.runtime.instance.SimpleSlot. >>> releaseSlot(SimpleSlot.java:217) >>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment. >>> releaseSharedSlot(SlotSharingGroupAssignment.java:523) >>> at org.apache.flink.runtime.instance.SharedSlot. >>> releaseSlot(SharedSlot.java:192) >>> at org.apache.flink.runtime.instance.Instance.markDead( >>> Instance.java:167) >>> at org.apache.flink.runtime.instance.InstanceManager. >>> unregisterTaskManager(InstanceManager.java:212) >>> at org.apache.flink.runtime.jobmanager.JobManager.org$ >>> apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated( >>> JobManager.scala:1198) >>> at org.apache.flink.runtime.jobmanager.JobManager$$ >>> anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096) >>> at scala.runtime.AbstractPartialFunction.apply( >>> AbstractPartialFunction.scala:36) >>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$ >>> anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) >>> at scala.runtime.AbstractPartialFunction.apply( >>> AbstractPartialFunction.scala:36) >>> at org.apache.flink.runtime.LogMessages$$anon$1.apply( >>> LogMessages.scala:33) >>> at org.apache.flink.runtime.LogMessages$$anon$1.apply( >>> LogMessages.scala:28) >>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>> at org.apache.flink.runtime.LogMessages$$anon$1. >>> applyOrElse(LogMessages.scala:28) >>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >>> at org.apache.flink.runtime.jobmanager.JobManager. >>> aroundReceive(JobManager.scala:122) >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated( >>> DeathWatch.scala:46) >>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) >>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) >>> at akka.actor.ActorCell.invoke(ActorCell.scala:494) >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >>> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue. >>> runTask(ForkJoinPool.java:1339) >>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker( >>> ForkJoinPool.java:1979) >>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( >>> ForkJoinWorkerThread.java:107) >>> 2018-04-02 13:09:01,727 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>> - Job Flink Streaming esb correlate msg (0db04ff29124f59a123d4743d89473ed) >>> switched from state RUNNING to FAILING. >>> java.lang.Exception: TaskManager was lost/killed: >>> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587) >>> at org.apache.flink.runtime.instance.SimpleSlot. >>> releaseSlot(SimpleSlot.java:217) >>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment. >>> releaseSharedSlot(SlotSharingGroupAssignment.java:523) >>> at org.apache.flink.runtime.instance.SharedSlot. >>> releaseSlot(SharedSlot.java:192) >>> at org.apache.flink.runtime.instance.Instance.markDead( >>> Instance.java:167) >>> at org.apache.flink.runtime.instance.InstanceManager. >>> unregisterTaskManager(InstanceManager.java:212) >>> at org.apache.flink.runtime.jobmanager.JobManager.org$ >>> apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated( >>> JobManager.scala:1198) >>> at org.apache.flink.runtime.jobmanager.JobManager$$ >>> anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096) >>> at scala.runtime.AbstractPartialFunction.apply( >>> AbstractPartialFunction.scala:36) >>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$ >>> anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) >>> at scala.runtime.AbstractPartialFunction.apply( >>> AbstractPartialFunction.scala:36) >>> at org.apache.flink.runtime.LogMessages$$anon$1.apply( >>> LogMessages.scala:33) >>> at org.apache.flink.runtime.LogMessages$$anon$1.apply( >>> LogMessages.scala:28) >>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>> at org.apache.flink.runtime.LogMessages$$anon$1. >>> applyOrElse(LogMessages.scala:28) >>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502) >>> at org.apache.flink.runtime.jobmanager.JobManager. >>> aroundReceive(JobManager.scala:122) >>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) >>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated( >>> DeathWatch.scala:46) >>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374) >>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511) >>> at akka.actor.ActorCell.invoke(ActorCell.scala:494) >>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) >>> at akka.dispatch.Mailbox.run(Mailbox.scala:224) >>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234) >>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue. >>> runTask(ForkJoinPool.java:1339) >>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker( >>> ForkJoinPool.java:1979) >>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( >>> ForkJoinWorkerThread.java:107) >>> 2018-04-02 13:09:01,737 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>> - Source: Custom Source (1/1) (a10c25c2d3de57d33828524938fcfcc2) >>> switched from RUNNING to CANCELING. >> >> >> >> >> >>