Hi Dawid, Thanks for your time on this. The diff should have pointed out only the top 3 commits, but since it did not, it is possible I did not rebase my branch against 1.4.2 correctly. I'll check this out and get back to you if I hit the same issue again.
Thanks again, Shailesh On Thu, Sep 27, 2018 at 1:00 PM Dawid Wysakowicz <dwysakow...@apache.org> wrote: > Hi Shailesh, > > I am afraid it is gonna be hard to help you, as this branch differs > significantly from 1.4.2 release (I've done diff across your branch and > tag/release-1.4.2). Moreover the code in the branch you've provided still > does not correspond to the lines in the exception you've posted previously. > Could you check if the problem occurs on vanilla flink as well? > > Best, > > Dawid > > On 27/09/18 08:22, Shailesh Jain wrote: > > Hi Dawid, > > Yes, it is version 1.4.2. We are running vanilla flink, but have added a > couple of changes in the CEP operator specifically (top 3 commits here: > https://github.com/jainshailesh/flink/commits/poc_on_1.4.2). Changes I've > made to CEP operators do not touch the checkpointing path, just overloading > the operator for a specific way of handling event time. > > We are hitting this in production, so I'm not sure it'll be feasible to > move to 1.6.0 immediately, but eventually yes. > > Thanks, > Shailesh > > On Wed, Sep 26, 2018 at 5:44 PM Dawid Wysakowicz <dwysakow...@apache.org> > wrote: > >> Hi Shailesh, >> >> Are you sure you are using version 1.4.2? Do you run a vanilla flink, or >> have you introduced some changes? I am asking cause the lines in stacktrace >> does not align with the source code for 1.4.2. >> >> Also it is a different exception than the one in the issue you've linked, >> so if it is a problem than it is definitely a different one. Last thing I >> would recommend upgrading to the newest version, as we rewritten the >> SharedBuffer implementation in 1.6.0. >> >> Best, >> >> Dawid >> >> On 26/09/18 13:50, Shailesh Jain wrote: >> >> Hi, >> >> I think I've hit this same issue on a 3 node standalone cluster (1.4.2) >> using HDFS (2.8.4) as state backend. >> >> 2018-09-26 17:07:39,370 INFO >> org.apache.flink.runtime.taskmanager.Task - Attempting >> to fail task externally SelectCepOperator (1/1) >> (3bec4aa1ef2226c4e0c5ff7b3860d340). >> 2018-09-26 17:07:39,370 INFO >> org.apache.flink.runtime.taskmanager.Task - >> SelectCepOperator (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340) switched from >> RUNNING to FAILED. >> AsynchronousException{java.lang.Exception: Could not materialize >> checkpoint 6 for operator SelectCepOperator (1/1).} >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.lang.Exception: Could not materialize checkpoint 6 for >> operator SelectCepOperator (1/1). >> ... 6 more >> Caused by: java.util.concurrent.ExecutionException: >> java.lang.NullPointerException >> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >> at >> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894) >> ... 5 more >> Suppressed: java.lang.Exception: Could not properly cancel managed >> keyed state future. >> at >> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939) >> ... 5 more >> Caused by: java.util.concurrent.ExecutionException: >> java.lang.NullPointerException >> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >> at >> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) >> at >> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66) >> at >> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89) >> ... 7 more >> Caused by: java.lang.NullPointerException >> at >> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954) >> at >> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825) >> at >> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888) >> at >> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820) >> at >> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196) >> at >> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390) >> at >> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339) >> at >> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894) >> ... 5 more >> [CIRCULAR REFERENCE:java.lang.NullPointerException] >> >> Any ideas on why I'm hitting this especially when this ( >> https://issues.apache.org/jira/browse/FLINK-7756) says it has been fixed >> in 1.4.2 ? >> >> On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio < >> federico.dambro...@smartlab.ws> wrote: >> >>> Thank you very much for your steady response, Kostas! >>> >>> Cheers, >>> Federico >>> >>> 2017-11-03 16:26 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>: >>> >>>> Hi Federico, >>>> >>>> Thanks for trying it out! >>>> Great to hear that your problem was fixed! >>>> >>>> The feature freeze for the release is going to be next week, and I >>>> would expect 1 or 2 more weeks testing. >>>> So I would say in 2.5 weeks. But this is of course subject to potential >>>> issues we may find during testing. >>>> >>>> Cheers, >>>> Kostas >>>> >>>> On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio < >>>> federico.dambro...@smartlab.ws> wrote: >>>> >>>> Hi Kostas, >>>> >>>> I just tried running the same job with 1.4-SNAPSHOT for 10 minutes and >>>> it didn't crash, so that was the same underlying issue of the JIRA you >>>> linked. >>>> >>>> Do you happen to know when it's expected the 1.4 stable release? >>>> >>>> Thank you very much, >>>> Federico >>>> >>>> 2017-11-03 15:25 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com> >>>> : >>>> >>>>> Perfect! thanks a lot! >>>>> >>>>> Kostas >>>>> >>>>> On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio < >>>>> federico.dambro...@smartlab.ws> wrote: >>>>> >>>>> Hi Kostas, >>>>> >>>>> yes, I'm using 1.3.2. I'll try the current master and I'll get back to >>>>> you. >>>>> >>>>> 2017-11-03 15:21 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com >>>>> >: >>>>> >>>>>> Hi Federico, >>>>>> >>>>>> I assume that you are using Flink 1.3, right? >>>>>> >>>>>> In this case, in 1.4 we have fixed a bug that seems similar to your >>>>>> case: >>>>>> https://issues.apache.org/jira/browse/FLINK-7756 >>>>>> >>>>>> Could you try the current master to see if it fixes your problem? >>>>>> >>>>>> Thanks, >>>>>> Kostas >>>>>> >>>>>> On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio < >>>>>> federico.dambro...@smartlab.ws> wrote: >>>>>> >>>>>> Could not find id for >>>>>> entry: >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Federico D'Ambrosio >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Federico D'Ambrosio >>>> >>>> >>>> >>> >>> >>> -- >>> Federico D'Ambrosio >>> >> >> >