Thank you, Stefan. Any ideas on when can we expect 1.6.3 release? On Thu, Nov 8, 2018 at 4:28 PM Stefan Richter <s.rich...@data-artisans.com> wrote:
> Sure, it is already merged as FLINK-10816. > > Best, > Stefan > > On 8. Nov 2018, at 11:53, Shailesh Jain <shailesh.j...@stellapps.com> > wrote: > > Thanks a lot for looking into this issue Stefan. > > Could you please let me know the issue ID once you open it? It'll help me > understand the problem better, and also I could do a quick test in our > environment once the issue is resolved. > > Thanks, > Shailesh > > On Wed, Nov 7, 2018, 10:46 PM Till Rohrmann <trohrm...@apache.org wrote: > >> Really good finding Stefan! >> >> On Wed, Nov 7, 2018 at 5:28 PM Stefan Richter < >> s.rich...@data-artisans.com> wrote: >> >>> Hi, >>> >>> I think I can already spot the >>> problem: LockableTypeSerializer.duplicate() is not properly implemented >>> because it also has to call duplicate() on the element serialiser that is >>> passed into the constructor of the new instance. I will open an issue and >>> fix the problem. >>> >>> Best, >>> Stefan >>> >>> On 7. Nov 2018, at 17:17, Till Rohrmann <trohrm...@apache.org> wrote: >>> >>> Hi Shailesh, >>> >>> could you maybe provide us with an example program which is able to >>> reproduce this problem? This would help the community to better debug the >>> problem. It looks not right and might point towards a bug in Flink. Thanks >>> a lot! >>> >>> Cheers, >>> Till >>> >>> On Tue, Oct 30, 2018 at 9:10 AM Dawid Wysakowicz <dwysakow...@apache.org> >>> wrote: >>> >>>> This is some problem with serializing your events using Kryo. I'm >>>> adding Gordon to cc, as he was recently working with serializers. He might >>>> give you more insights what is going wrong. >>>> >>>> Best, >>>> >>>> Dawid >>>> On 25/10/2018 05:41, Shailesh Jain wrote: >>>> >>>> Hi Dawid, >>>> >>>> I've upgraded to flink 1.6.1 and rebased by changes against the tag >>>> 1.6.1, the only commit on top of 1.6 is this: >>>> https://github.com/jainshailesh/flink/commit/797e3c4af5b28263fd98fb79daaba97cabf3392c >>>> >>>> I ran two separate identical jobs (with and without checkpointing >>>> enabled), I'm hitting a ArrayIndexOutOfBoundsException (and sometimes NPE) >>>> *only >>>> when checkpointing (HDFS backend) is enabled*, with the below stack >>>> trace. >>>> >>>> I did see a similar problem with different operators here ( >>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/IndexOutOfBoundsException-on-deserialization-after-updating-to-1-6-1-td23933.html). >>>> Is this a known issue which is getting addressed? >>>> >>>> Any ideas on what could be causing this? >>>> >>>> Thanks, >>>> Shailesh >>>> >>>> >>>> 2018-10-24 17:04:13,365 INFO >>>> org.apache.flink.runtime.taskmanager.Task - >>>> SelectCepOperatorMixedTime (1/1) - SelectCepOperatorMixedTime (1/1) >>>> (3d984b7919342a3886593401088ca2cd) switched from RUNNING to FAILED. >>>> org.apache.flink.util.FlinkRuntimeException: Failure happened in filter >>>> function. >>>> at >>>> org.apache.flink.cep.nfa.NFA.createDecisionGraph(NFA.java:731) >>>> at org.apache.flink.cep.nfa.NFA.computeNextStates(NFA.java:541) >>>> at org.apache.flink.cep.nfa.NFA.doProcess(NFA.java:284) >>>> at org.apache.flink.cep.nfa.NFA.process(NFA.java:220) >>>> at >>>> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.processEvent(AbstractKeyedCEPPatternOperator.java:379) >>>> at >>>> org.apache.flink.cep.operator.AbstractKeyedCEPPatternMixedTimeApproachOperator.processElement(AbstractKeyedCEPPatternMixedTimeApproachOperator.java:45) >>>> at >>>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) >>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) >>>> at java.lang.Thread.run(Thread.java:748) >>>> Caused by: org.apache.flink.util.WrappingRuntimeException: >>>> java.lang.ArrayIndexOutOfBoundsException: -1 >>>> at >>>> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.lambda$materializeMatch$1(SharedBuffer.java:305) >>>> at java.util.HashMap.computeIfAbsent(HashMap.java:1127) >>>> at >>>> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.materializeMatch(SharedBuffer.java:301) >>>> at >>>> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.materializeMatch(SharedBuffer.java:291) >>>> at >>>> org.apache.flink.cep.nfa.NFA$ConditionContext.getEventsForPattern(NFA.java:811) >>>> at >>>> com.stellapps.contrakcep.flink.patterns.critical.AgitatorMalfunctionAfterChillingPattern$1.filter(AgitatorMalfunctionAfterChillingPattern.java:70) >>>> at >>>> com.stellapps.contrakcep.flink.patterns.critical.AgitatorMalfunctionAfterChillingPattern$1.filter(AgitatorMalfunctionAfterChillingPattern.java:62) >>>> at >>>> org.apache.flink.cep.nfa.NFA.checkFilterCondition(NFA.java:742) >>>> at >>>> org.apache.flink.cep.nfa.NFA.createDecisionGraph(NFA.java:716) >>>> ... 10 more >>>> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 >>>> at com.esotericsoftware.kryo.util.IntArray.add(IntArray.java:61) >>>> at >>>> com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:800) >>>> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:655) >>>> at >>>> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:231) >>>> at >>>> org.apache.flink.cep.nfa.sharedbuffer.Lockable$LockableTypeSerializer.copy(Lockable.java:120) >>>> at >>>> org.apache.flink.cep.nfa.sharedbuffer.Lockable$LockableTypeSerializer.copy(Lockable.java:95) >>>> at >>>> org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:113) >>>> at >>>> org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:49) >>>> at >>>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:287) >>>> at >>>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:311) >>>> at >>>> org.apache.flink.runtime.state.heap.HeapMapState.get(HeapMapState.java:85) >>>> at >>>> org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47) >>>> at >>>> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.lambda$materializeMatch$1(SharedBuffer.java:303) >>>> ... 18 more >>>> >>>> On Fri, Sep 28, 2018 at 11:00 AM Shailesh Jain < >>>> shailesh.j...@stellapps.com> wrote: >>>> >>>>> Hi Dawid, >>>>> >>>>> Thanks for your time on this. The diff should have pointed out only >>>>> the top 3 commits, but since it did not, it is possible I did not rebase >>>>> my >>>>> branch against 1.4.2 correctly. I'll check this out and get back to you if >>>>> I hit the same issue again. >>>>> >>>>> Thanks again, >>>>> Shailesh >>>>> >>>>> On Thu, Sep 27, 2018 at 1:00 PM Dawid Wysakowicz < >>>>> dwysakow...@apache.org> wrote: >>>>> >>>>>> Hi Shailesh, >>>>>> >>>>>> I am afraid it is gonna be hard to help you, as this branch differs >>>>>> significantly from 1.4.2 release (I've done diff across your branch and >>>>>> tag/release-1.4.2). Moreover the code in the branch you've provided still >>>>>> does not correspond to the lines in the exception you've posted >>>>>> previously. >>>>>> Could you check if the problem occurs on vanilla flink as well? >>>>>> >>>>>> Best, >>>>>> >>>>>> Dawid >>>>>> >>>>>> On 27/09/18 08:22, Shailesh Jain wrote: >>>>>> >>>>>> Hi Dawid, >>>>>> >>>>>> Yes, it is version 1.4.2. We are running vanilla flink, but have >>>>>> added a couple of changes in the CEP operator specifically (top 3 commits >>>>>> here: https://github.com/jainshailesh/flink/commits/poc_on_1.4.2). >>>>>> Changes I've made to CEP operators do not touch the checkpointing path, >>>>>> just overloading the operator for a specific way of handling event time. >>>>>> >>>>>> We are hitting this in production, so I'm not sure it'll be feasible >>>>>> to move to 1.6.0 immediately, but eventually yes. >>>>>> >>>>>> Thanks, >>>>>> Shailesh >>>>>> >>>>>> On Wed, Sep 26, 2018 at 5:44 PM Dawid Wysakowicz < >>>>>> dwysakow...@apache.org> wrote: >>>>>> >>>>>>> Hi Shailesh, >>>>>>> >>>>>>> Are you sure you are using version 1.4.2? Do you run a vanilla >>>>>>> flink, or have you introduced some changes? I am asking cause the lines >>>>>>> in >>>>>>> stacktrace does not align with the source code for 1.4.2. >>>>>>> >>>>>>> Also it is a different exception than the one in the issue you've >>>>>>> linked, so if it is a problem than it is definitely a different one. >>>>>>> Last >>>>>>> thing I would recommend upgrading to the newest version, as we rewritten >>>>>>> the SharedBuffer implementation in 1.6.0. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Dawid >>>>>>> >>>>>>> On 26/09/18 13:50, Shailesh Jain wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I think I've hit this same issue on a 3 node standalone cluster >>>>>>> (1.4.2) using HDFS (2.8.4) as state backend. >>>>>>> >>>>>>> 2018-09-26 17:07:39,370 INFO >>>>>>> org.apache.flink.runtime.taskmanager.Task - >>>>>>> Attempting >>>>>>> to fail task externally SelectCepOperator (1/1) >>>>>>> (3bec4aa1ef2226c4e0c5ff7b3860d340). >>>>>>> 2018-09-26 17:07:39,370 INFO >>>>>>> org.apache.flink.runtime.taskmanager.Task - >>>>>>> SelectCepOperator (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340) switched from >>>>>>> RUNNING to FAILED. >>>>>>> AsynchronousException{java.lang.Exception: Could not materialize >>>>>>> checkpoint 6 for operator SelectCepOperator (1/1).} >>>>>>> at >>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948) >>>>>>> at >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>>> Caused by: java.lang.Exception: Could not materialize checkpoint 6 >>>>>>> for operator SelectCepOperator (1/1). >>>>>>> ... 6 more >>>>>>> Caused by: java.util.concurrent.ExecutionException: >>>>>>> java.lang.NullPointerException >>>>>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >>>>>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>>>>>> at >>>>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) >>>>>>> at >>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894) >>>>>>> ... 5 more >>>>>>> Suppressed: java.lang.Exception: Could not properly cancel >>>>>>> managed keyed state future. >>>>>>> at >>>>>>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91) >>>>>>> at >>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976) >>>>>>> at >>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939) >>>>>>> ... 5 more >>>>>>> Caused by: java.util.concurrent.ExecutionException: >>>>>>> java.lang.NullPointerException >>>>>>> at >>>>>>> java.util.concurrent.FutureTask.report(FutureTask.java:122) >>>>>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>>>>>> at >>>>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) >>>>>>> at >>>>>>> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66) >>>>>>> at >>>>>>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89) >>>>>>> ... 7 more >>>>>>> Caused by: java.lang.NullPointerException >>>>>>> at >>>>>>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954) >>>>>>> at >>>>>>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825) >>>>>>> at >>>>>>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888) >>>>>>> at >>>>>>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820) >>>>>>> at >>>>>>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196) >>>>>>> at >>>>>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390) >>>>>>> at >>>>>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339) >>>>>>> at >>>>>>> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75) >>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>>> at >>>>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40) >>>>>>> at >>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894) >>>>>>> ... 5 more >>>>>>> [CIRCULAR REFERENCE:java.lang.NullPointerException] >>>>>>> >>>>>>> Any ideas on why I'm hitting this especially when this ( >>>>>>> https://issues.apache.org/jira/browse/FLINK-7756) says it has been >>>>>>> fixed in 1.4.2 ? >>>>>>> >>>>>>> On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio < >>>>>>> federico.dambro...@smartlab.ws> wrote: >>>>>>> >>>>>>>> Thank you very much for your steady response, Kostas! >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Federico >>>>>>>> >>>>>>>> 2017-11-03 16:26 GMT+01:00 Kostas Kloudas < >>>>>>>> k.klou...@data-artisans.com>: >>>>>>>> >>>>>>>>> Hi Federico, >>>>>>>>> >>>>>>>>> Thanks for trying it out! >>>>>>>>> Great to hear that your problem was fixed! >>>>>>>>> >>>>>>>>> The feature freeze for the release is going to be next week, and I >>>>>>>>> would expect 1 or 2 more weeks testing. >>>>>>>>> So I would say in 2.5 weeks. But this is of course subject to >>>>>>>>> potential issues we may find during testing. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Kostas >>>>>>>>> >>>>>>>>> On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio < >>>>>>>>> federico.dambro...@smartlab.ws> wrote: >>>>>>>>> >>>>>>>>> Hi Kostas, >>>>>>>>> >>>>>>>>> I just tried running the same job with 1.4-SNAPSHOT for 10 minutes >>>>>>>>> and it didn't crash, so that was the same underlying issue of the >>>>>>>>> JIRA you >>>>>>>>> linked. >>>>>>>>> >>>>>>>>> Do you happen to know when it's expected the 1.4 stable release? >>>>>>>>> >>>>>>>>> Thank you very much, >>>>>>>>> Federico >>>>>>>>> >>>>>>>>> 2017-11-03 15:25 GMT+01:00 Kostas Kloudas < >>>>>>>>> k.klou...@data-artisans.com>: >>>>>>>>> >>>>>>>>>> Perfect! thanks a lot! >>>>>>>>>> >>>>>>>>>> Kostas >>>>>>>>>> >>>>>>>>>> On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio < >>>>>>>>>> federico.dambro...@smartlab.ws> wrote: >>>>>>>>>> >>>>>>>>>> Hi Kostas, >>>>>>>>>> >>>>>>>>>> yes, I'm using 1.3.2. I'll try the current master and I'll get >>>>>>>>>> back to you. >>>>>>>>>> >>>>>>>>>> 2017-11-03 15:21 GMT+01:00 Kostas Kloudas < >>>>>>>>>> k.klou...@data-artisans.com>: >>>>>>>>>> >>>>>>>>>>> Hi Federico, >>>>>>>>>>> >>>>>>>>>>> I assume that you are using Flink 1.3, right? >>>>>>>>>>> >>>>>>>>>>> In this case, in 1.4 we have fixed a bug that seems similar to >>>>>>>>>>> your case: >>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-7756 >>>>>>>>>>> >>>>>>>>>>> Could you try the current master to see if it fixes your problem? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Kostas >>>>>>>>>>> >>>>>>>>>>> On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio < >>>>>>>>>>> federico.dambro...@smartlab.ws> wrote: >>>>>>>>>>> >>>>>>>>>>> Could not find id for >>>>>>>>>>> entry: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Federico D'Ambrosio >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Federico D'Ambrosio >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Federico D'Ambrosio >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>> >