Hi, I think I can already spot the problem: LockableTypeSerializer.duplicate() is not properly implemented because it also has to call duplicate() on the element serialiser that is passed into the constructor of the new instance. I will open an issue and fix the problem.
Best, Stefan > On 7. Nov 2018, at 17:17, Till Rohrmann <trohrm...@apache.org> wrote: > > Hi Shailesh, > > could you maybe provide us with an example program which is able to reproduce > this problem? This would help the community to better debug the problem. It > looks not right and might point towards a bug in Flink. Thanks a lot! > > Cheers, > Till > > On Tue, Oct 30, 2018 at 9:10 AM Dawid Wysakowicz <dwysakow...@apache.org > <mailto:dwysakow...@apache.org>> wrote: > This is some problem with serializing your events using Kryo. I'm adding > Gordon to cc, as he was recently working with serializers. He might give you > more insights what is going wrong. > > Best, > > Dawid > > On 25/10/2018 05:41, Shailesh Jain wrote: >> Hi Dawid, >> >> I've upgraded to flink 1.6.1 and rebased by changes against the tag 1.6.1, >> the only commit on top of 1.6 is this: >> https://github.com/jainshailesh/flink/commit/797e3c4af5b28263fd98fb79daaba97cabf3392c >> >> <https://github.com/jainshailesh/flink/commit/797e3c4af5b28263fd98fb79daaba97cabf3392c> >> >> I ran two separate identical jobs (with and without checkpointing enabled), >> I'm hitting a ArrayIndexOutOfBoundsException (and sometimes NPE) only when >> checkpointing (HDFS backend) is enabled, with the below stack trace. >> >> I did see a similar problem with different operators here >> (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/IndexOutOfBoundsException-on-deserialization-after-updating-to-1-6-1-td23933.html >> >> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/IndexOutOfBoundsException-on-deserialization-after-updating-to-1-6-1-td23933.html>). >> Is this a known issue which is getting addressed? >> >> Any ideas on what could be causing this? >> >> Thanks, >> Shailesh >> >> >> 2018-10-24 17:04:13,365 INFO org.apache.flink.runtime.taskmanager.Task >> - SelectCepOperatorMixedTime (1/1) - >> SelectCepOperatorMixedTime (1/1) (3d984b7919342a3886593401088ca2cd) switched >> from RUNNING to FAILED. >> org.apache.flink.util.FlinkRuntimeException: Failure happened in filter >> function. >> at org.apache.flink.cep.nfa.NFA.createDecisionGraph(NFA.java:731) >> at org.apache.flink.cep.nfa.NFA.computeNextStates(NFA.java:541) >> at org.apache.flink.cep.nfa.NFA.doProcess(NFA.java:284) >> at org.apache.flink.cep.nfa.NFA.process(NFA.java:220) >> at >> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.processEvent(AbstractKeyedCEPPatternOperator.java:379) >> at >> org.apache.flink.cep.operator.AbstractKeyedCEPPatternMixedTimeApproachOperator.processElement(AbstractKeyedCEPPatternMixedTimeApproachOperator.java:45) >> at >> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202) >> at >> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105) >> at >> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: org.apache.flink.util.WrappingRuntimeException: >> java.lang.ArrayIndexOutOfBoundsException: -1 >> at >> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.lambda$materializeMatch$1(SharedBuffer.java:305) >> at java.util.HashMap.computeIfAbsent(HashMap.java:1127) >> at >> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.materializeMatch(SharedBuffer.java:301) >> at >> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.materializeMatch(SharedBuffer.java:291) >> at >> org.apache.flink.cep.nfa.NFA$ConditionContext.getEventsForPattern(NFA.java:811) >> at >> com.stellapps.contrakcep.flink.patterns.critical.AgitatorMalfunctionAfterChillingPattern$1.filter(AgitatorMalfunctionAfterChillingPattern.java:70) >> at >> com.stellapps.contrakcep.flink.patterns.critical.AgitatorMalfunctionAfterChillingPattern$1.filter(AgitatorMalfunctionAfterChillingPattern.java:62) >> at org.apache.flink.cep.nfa.NFA.checkFilterCondition(NFA.java:742) >> at org.apache.flink.cep.nfa.NFA.createDecisionGraph(NFA.java:716) >> ... 10 more >> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 >> at com.esotericsoftware.kryo.util.IntArray.add(IntArray.java:61) >> at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:800) >> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:655) >> at >> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:231) >> at >> org.apache.flink.cep.nfa.sharedbuffer.Lockable$LockableTypeSerializer.copy(Lockable.java:120) >> at >> org.apache.flink.cep.nfa.sharedbuffer.Lockable$LockableTypeSerializer.copy(Lockable.java:95) >> at >> org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:113) >> at >> org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:49) >> at >> org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:287) >> at >> org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:311) >> at >> org.apache.flink.runtime.state.heap.HeapMapState.get(HeapMapState.java:85) >> at >> org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47) >> at >> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.lambda$materializeMatch$1(SharedBuffer.java:303) >> ... 18 more >> >> On Fri, Sep 28, 2018 at 11:00 AM Shailesh Jain <shailesh.j...@stellapps.com >> <mailto:shailesh.j...@stellapps.com>> wrote: >> Hi Dawid, >> >> Thanks for your time on this. The diff should have pointed out only the top >> 3 commits, but since it did not, it is possible I did not rebase my branch >> against 1.4.2 correctly. I'll check this out and get back to you if I hit >> the same issue again. >> >> Thanks again, >> Shailesh >> >> On Thu, Sep 27, 2018 at 1:00 PM Dawid Wysakowicz <dwysakow...@apache.org >> <mailto:dwysakow...@apache.org>> wrote: >> Hi Shailesh, >> >> I am afraid it is gonna be hard to help you, as this branch differs >> significantly from 1.4.2 release (I've done diff across your branch and >> tag/release-1.4.2). Moreover the code in the branch you've provided still >> does not correspond to the lines in the exception you've posted previously. >> Could you check if the problem occurs on vanilla flink as well? >> >> Best, >> >> Dawid >> >> >> On 27/09/18 08:22, Shailesh Jain wrote: >>> Hi Dawid, >>> >>> Yes, it is version 1.4.2. We are running vanilla flink, but have added a >>> couple of changes in the CEP operator specifically (top 3 commits here: >>> https://github.com/jainshailesh/flink/commits/poc_on_1.4.2 >>> <https://github.com/jainshailesh/flink/commits/poc_on_1.4.2>). Changes I've >>> made to CEP operators do not touch the checkpointing path, just overloading >>> the operator for a specific way of handling event time. >>> >>> We are hitting this in production, so I'm not sure it'll be feasible to >>> move to 1.6.0 immediately, but eventually yes. >>> >>> Thanks, >>> Shailesh >>> >>> On Wed, Sep 26, 2018 at 5:44 PM Dawid Wysakowicz <dwysakow...@apache.org >>> <mailto:dwysakow...@apache.org>> wrote: >>> Hi Shailesh, >>> >>> Are you sure you are using version 1.4.2? Do you run a vanilla flink, or >>> have you introduced some changes? I am asking cause the lines in stacktrace >>> does not align with the source code for 1.4.2. >>> >>> Also it is a different exception than the one in the issue you've linked, >>> so if it is a problem than it is definitely a different one. Last thing I >>> would recommend upgrading to the newest version, as we rewritten the >>> SharedBuffer implementation in 1.6.0. >>> >>> Best, >>> >>> Dawid >>> >>> >>> On 26/09/18 13:50, Shailesh Jain wrote: >>>> Hi, >>>> >>>> I think I've hit this same issue on a 3 node standalone cluster (1.4.2) >>>> using HDFS (2.8.4) as state backend. >>>> >>>> 2018-09-26 17:07:39,370 INFO org.apache.flink.runtime.taskmanager.Task >>>> - Attempting to fail task externally SelectCepOperator >>>> (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340). >>>> 2018-09-26 17:07:39,370 INFO org.apache.flink.runtime.taskmanager.Task >>>> - SelectCepOperator (1/1) >>>> (3bec4aa1ef2226c4e0c5ff7b3860d340) switched from RUNNING to FAILED. >>>> AsynchronousException{java.lang.Exception: Could not materialize >>>> checkpoint 6 for operator SelectCepOperator (1/1).} >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948) >>>> at >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>> at java.lang.Thread.run(Thread.java:748) >>>> Caused by: java.lang.Exception: Could not materialize checkpoint 6 for >>>> operator SelectCepOperator (1/1). >>>> ... 6 more >>>> Caused by: java.util.concurrent.ExecutionException: >>>> java.lang.NullPointerException >>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>>> at >>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894) >>>> ... 5 more >>>> Suppressed: java.lang.Exception: Could not properly cancel managed >>>> keyed state future. >>>> at >>>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939) >>>> ... 5 more >>>> Caused by: java.util.concurrent.ExecutionException: >>>> java.lang.NullPointerException >>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>>> at >>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) >>>> at >>>> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66) >>>> at >>>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89) >>>> ... 7 more >>>> Caused by: java.lang.NullPointerException >>>> at >>>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954) >>>> at >>>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825) >>>> at >>>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888) >>>> at >>>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820) >>>> at >>>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196) >>>> at >>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390) >>>> at >>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339) >>>> at >>>> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> at >>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894) >>>> ... 5 more >>>> [CIRCULAR REFERENCE:java.lang.NullPointerException] >>>> >>>> Any ideas on why I'm hitting this especially when this >>>> (https://issues.apache.org/jira/browse/FLINK-7756 >>>> <https://issues.apache.org/jira/browse/FLINK-7756>) says it has been fixed >>>> in 1.4.2 ? >>>> >>>> On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio >>>> <federico.dambro...@smartlab.ws <mailto:federico.dambro...@smartlab.ws>> >>>> wrote: >>>> Thank you very much for your steady response, Kostas! >>>> >>>> Cheers, >>>> Federico >>>> >>>> 2017-11-03 16:26 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com >>>> <mailto:k.klou...@data-artisans.com>>: >>>> Hi Federico, >>>> >>>> Thanks for trying it out! >>>> Great to hear that your problem was fixed! >>>> >>>> The feature freeze for the release is going to be next week, and I would >>>> expect 1 or 2 more weeks testing. >>>> So I would say in 2.5 weeks. But this is of course subject to potential >>>> issues we may find during testing. >>>> >>>> Cheers, >>>> Kostas >>>> >>>>> On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio >>>>> <federico.dambro...@smartlab.ws <mailto:federico.dambro...@smartlab.ws>> >>>>> wrote: >>>>> >>>>> Hi Kostas, >>>>> >>>>> I just tried running the same job with 1.4-SNAPSHOT for 10 minutes and it >>>>> didn't crash, so that was the same underlying issue of the JIRA you >>>>> linked. >>>>> >>>>> Do you happen to know when it's expected the 1.4 stable release? >>>>> >>>>> Thank you very much, >>>>> Federico >>>>> >>>>> 2017-11-03 15:25 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com >>>>> <mailto:k.klou...@data-artisans.com>>: >>>>> Perfect! thanks a lot! >>>>> >>>>> Kostas >>>>> >>>>>> On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio >>>>>> <federico.dambro...@smartlab.ws <mailto:federico.dambro...@smartlab.ws>> >>>>>> wrote: >>>>>> >>>>>> Hi Kostas, >>>>>> >>>>>> yes, I'm using 1.3.2. I'll try the current master and I'll get back to >>>>>> you. >>>>>> >>>>>> 2017-11-03 15:21 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com >>>>>> <mailto:k.klou...@data-artisans.com>>: >>>>>> Hi Federico, >>>>>> >>>>>> I assume that you are using Flink 1.3, right? >>>>>> >>>>>> In this case, in 1.4 we have fixed a bug that seems similar to your case: >>>>>> https://issues.apache.org/jira/browse/FLINK-7756 >>>>>> <https://issues.apache.org/jira/browse/FLINK-7756> >>>>>> >>>>>> Could you try the current master to see if it fixes your problem? >>>>>> >>>>>> Thanks, >>>>>> Kostas >>>>>> >>>>>>> On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio >>>>>>> <federico.dambro...@smartlab.ws >>>>>>> <mailto:federico.dambro...@smartlab.ws>> wrote: >>>>>>> >>>>>>> Could not find id for entry: >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Federico D'Ambrosio >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Federico D'Ambrosio >>>> >>>> >>>> >>>> >>>> -- >>>> Federico D'Ambrosio >>> >>