HI ,

i checked the code again the figure out where the problem  can be

i just wondered if im implementing the Evictor correctly  ?

full code
https://gist.github.com/miko-code/6d7010505c3cb95be122364b29057237




public static class EsbTraceEvictor implements Evictor<EsbTrace, GlobalWindow> {
    org.slf4j.Logger LOG = LoggerFactory.getLogger(EsbTraceEvictor.class);
    @Override
    public void evictBefore(Iterable<TimestampedValue<EsbTrace>>
iterable, int i, GlobalWindow globalWindow, Evictor.EvictorContext
evictorContext) {

    }

    @Override
    public void evictAfter(Iterable<TimestampedValue<EsbTrace>>
elements, int i, GlobalWindow globalWindow, EvictorContext
evictorContext) {
        //change it to current procces  time
        long min5min = LocalDateTime.now().minusMinutes(5).getNano();
        LOG.info("time now -5min",min5min);
        DateTimeFormatter format = DateTimeFormatter.ISO_DATE_TIME;
        for (Iterator<TimestampedValue<EsbTrace>> iterator =
elements.iterator(); iterator.hasNext(); ) {
            TimestampedValue<EsbTrace> element = iterator.next();
            LocalDateTime el =
LocalDateTime.parse(element.getValue().getEndDate(),format);
            LOG.info("element time ",element.getValue().getEndDate());
            if (el.minusMinutes(5).getNano() <= min5min) {
                iterator.remove();
            }
        }
    }
}






On Tue, Apr 3, 2018 at 4:28 PM, Hao Sun <ha...@zendesk.com> wrote:

> Hi Timo, we do have similar issue, TM got killed by a job. Is there a way
> to monitor JVM status? If through the monitor metrics, what metric I should
> look after?
> We are running Flink on K8S. Is there a possibility that a job consumes
> too much network bandwidth, so JM and TM can not connect?
>
> On Tue, Apr 3, 2018 at 3:11 AM Timo Walther <twal...@apache.org> wrote:
>
>> Hi Miki,
>>
>> for me this sounds like your job has a resource leak such that your
>> memory fills up and the JVM of the TaskManager is killed at some point. How
>> does your job look like? I see a WindowedStream.apply which might not be
>> appropriate if you have big/frequent windows where the evaluation happens
>> too late such that the state becomes too big.
>>
>> Regards,
>> Timo
>>
>>
>> Am 03.04.18 um 08:26 schrieb miki haiat:
>>
>> i tried to run flink on kubernetes and  as stand alone HA cluster and on
>> both cases  task manger got lost/kill after few hours/days    .
>> im using ubuntu and flink 1.4.2 .
>>
>>
>> this is part of the log , i also attaches the full log .
>>
>>>
>>> org.tlv.esb.StreamingJob$EsbTraceEvictor@20ffca60, 
>>> WindowedStream.apply(WindowedStream.java:1061))
>>> -> Sink: Unnamed (1/1) (91b27853aa30be93322d9c516ec266bf) switched from
>>> RUNNING to FAILED.
>>> java.lang.Exception: TaskManager was lost/killed:
>>> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587)
>>> at org.apache.flink.runtime.instance.SimpleSlot.
>>> releaseSlot(SimpleSlot.java:217)
>>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
>>> releaseSharedSlot(SlotSharingGroupAssignment.java:523)
>>> at org.apache.flink.runtime.instance.SharedSlot.
>>> releaseSlot(SharedSlot.java:192)
>>> at org.apache.flink.runtime.instance.Instance.markDead(
>>> Instance.java:167)
>>> at org.apache.flink.runtime.instance.InstanceManager.
>>> unregisterTaskManager(InstanceManager.java:212)
>>> at org.apache.flink.runtime.jobmanager.JobManager.org$
>>> apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
>>> JobManager.scala:1198)
>>> at org.apache.flink.runtime.jobmanager.JobManager$$
>>> anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096)
>>> at scala.runtime.AbstractPartialFunction.apply(
>>> AbstractPartialFunction.scala:36)
>>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$
>>> anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>> at scala.runtime.AbstractPartialFunction.apply(
>>> AbstractPartialFunction.scala:36)
>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(
>>> LogMessages.scala:33)
>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(
>>> LogMessages.scala:28)
>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>> at org.apache.flink.runtime.LogMessages$$anon$1.
>>> applyOrElse(LogMessages.scala:28)
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>> at org.apache.flink.runtime.jobmanager.JobManager.
>>> aroundReceive(JobManager.scala:122)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated(
>>> DeathWatch.scala:46)
>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374)
>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:494)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
>>> runTask(ForkJoinPool.java:1339)
>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
>>> ForkJoinPool.java:1979)
>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
>>> ForkJoinWorkerThread.java:107)
>>> 2018-04-02 13:09:01,727 INFO 
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>> - Job Flink Streaming esb correlate msg (0db04ff29124f59a123d4743d89473ed)
>>> switched from state RUNNING to FAILING.
>>> java.lang.Exception: TaskManager was lost/killed:
>>> 6dc6cd5c15588b49da39a31b6480b2e3 @ beam2 (dataPort=42587)
>>> at org.apache.flink.runtime.instance.SimpleSlot.
>>> releaseSlot(SimpleSlot.java:217)
>>> at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
>>> releaseSharedSlot(SlotSharingGroupAssignment.java:523)
>>> at org.apache.flink.runtime.instance.SharedSlot.
>>> releaseSlot(SharedSlot.java:192)
>>> at org.apache.flink.runtime.instance.Instance.markDead(
>>> Instance.java:167)
>>> at org.apache.flink.runtime.instance.InstanceManager.
>>> unregisterTaskManager(InstanceManager.java:212)
>>> at org.apache.flink.runtime.jobmanager.JobManager.org$
>>> apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
>>> JobManager.scala:1198)
>>> at org.apache.flink.runtime.jobmanager.JobManager$$
>>> anonfun$handleMessage$1.applyOrElse(JobManager.scala:1096)
>>> at scala.runtime.AbstractPartialFunction.apply(
>>> AbstractPartialFunction.scala:36)
>>> at org.apache.flink.runtime.LeaderSessionMessageFilter$$
>>> anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
>>> at scala.runtime.AbstractPartialFunction.apply(
>>> AbstractPartialFunction.scala:36)
>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(
>>> LogMessages.scala:33)
>>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(
>>> LogMessages.scala:28)
>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>> at org.apache.flink.runtime.LogMessages$$anon$1.
>>> applyOrElse(LogMessages.scala:28)
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>> at org.apache.flink.runtime.jobmanager.JobManager.
>>> aroundReceive(JobManager.scala:122)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>> at akka.actor.dungeon.DeathWatch$class.receivedTerminated(
>>> DeathWatch.scala:46)
>>> at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:374)
>>> at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:511)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:494)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
>>> runTask(ForkJoinPool.java:1339)
>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
>>> ForkJoinPool.java:1979)
>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
>>> ForkJoinWorkerThread.java:107)
>>> 2018-04-02 13:09:01,737 INFO 
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>> - Source: Custom Source (1/1) (a10c25c2d3de57d33828524938fcfcc2)
>>> switched from RUNNING to CANCELING.
>>
>>
>>
>>
>>
>>

Reply via email to