Re: FlinkCEP, circular references and checkpointing failures

Stefan Richter Thu, 08 Nov 2018 13:14:30 -0800

Sure, it is already merged as FLINK-10816.

Best,
Stefan


> On 8. Nov 2018, at 11:53, Shailesh Jain <shailesh.j...@stellapps.com> wrote:
> 
> Thanks a lot for looking into this issue Stefan.
> 
> Could you please let me know the issue ID once you open it? It'll help me 
> understand the problem better, and also I could do a quick test in our 
> environment once the issue is resolved.
> 
> Thanks,
> Shailesh
> 
> On Wed, Nov 7, 2018, 10:46 PM Till Rohrmann <trohrm...@apache.org 
> <mailto:trohrm...@apache.org> wrote:
> Really good finding Stefan!
> 
> On Wed, Nov 7, 2018 at 5:28 PM Stefan Richter <s.rich...@data-artisans.com 
> <mailto:s.rich...@data-artisans.com>> wrote:
> Hi,
> 
> I think I can already spot the problem: LockableTypeSerializer.duplicate() is 
> not properly implemented because it also has to call duplicate() on the 
> element serialiser that is passed into the constructor of the new instance. I 
> will open an issue and fix the problem.
> 
> Best,
> Stefan
> 
>> On 7. Nov 2018, at 17:17, Till Rohrmann <trohrm...@apache.org 
>> <mailto:trohrm...@apache.org>> wrote:
>> 
>> Hi Shailesh,
>> 
>> could you maybe provide us with an example program which is able to 
>> reproduce this problem? This would help the community to better debug the 
>> problem. It looks not right and might point towards a bug in Flink. Thanks a 
>> lot!
>> 
>> Cheers,
>> Till
>> 
>> On Tue, Oct 30, 2018 at 9:10 AM Dawid Wysakowicz <dwysakow...@apache.org 
>> <mailto:dwysakow...@apache.org>> wrote:
>> This is some problem with serializing your events using Kryo. I'm adding 
>> Gordon to cc, as he was recently working with serializers. He might give you 
>> more insights what is going wrong.
>> 
>> Best,
>> 
>> Dawid
>> 
>> On 25/10/2018 05:41, Shailesh Jain wrote:
>>> Hi Dawid,
>>> 
>>> I've upgraded to flink 1.6.1 and rebased by changes against the tag 1.6.1, 
>>> the only commit on top of 1.6 is this: 
>>> https://github.com/jainshailesh/flink/commit/797e3c4af5b28263fd98fb79daaba97cabf3392c
>>>  
>>> <https://github.com/jainshailesh/flink/commit/797e3c4af5b28263fd98fb79daaba97cabf3392c>
>>> 
>>> I ran two separate identical jobs (with and without checkpointing enabled), 
>>> I'm hitting a ArrayIndexOutOfBoundsException (and sometimes NPE) only when 
>>> checkpointing (HDFS backend) is enabled, with the below stack trace.
>>> 
>>> I did see a similar problem with different operators here 
>>> (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/IndexOutOfBoundsException-on-deserialization-after-updating-to-1-6-1-td23933.html
>>>  
>>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/IndexOutOfBoundsException-on-deserialization-after-updating-to-1-6-1-td23933.html>).
>>>  Is this a known issue which is getting addressed?
>>> 
>>> Any ideas on what could be causing this?
>>> 
>>> Thanks,
>>> Shailesh
>>> 
>>> 
>>> 2018-10-24 17:04:13,365 INFO  org.apache.flink.runtime.taskmanager.Task     
>>>                 - SelectCepOperatorMixedTime (1/1) - 
>>> SelectCepOperatorMixedTime (1/1) (3d984b7919342a3886593401088ca2cd) 
>>> switched from RUNNING to FAILED.
>>> org.apache.flink.util.FlinkRuntimeException: Failure happened in filter 
>>> function.
>>>         at org.apache.flink.cep.nfa.NFA.createDecisionGraph(NFA.java:731)
>>>         at org.apache.flink.cep.nfa.NFA.computeNextStates(NFA.java:541)
>>>         at org.apache.flink.cep.nfa.NFA.doProcess(NFA.java:284)
>>>         at org.apache.flink.cep.nfa.NFA.process(NFA.java:220)
>>>         at 
>>> org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.processEvent(AbstractKeyedCEPPatternOperator.java:379)
>>>         at 
>>> org.apache.flink.cep.operator.AbstractKeyedCEPPatternMixedTimeApproachOperator.processElement(AbstractKeyedCEPPatternMixedTimeApproachOperator.java:45)
>>>         at 
>>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
>>>         at 
>>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>>>         at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>>>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>>>         at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.WrappingRuntimeException: 
>>> java.lang.ArrayIndexOutOfBoundsException: -1
>>>         at 
>>> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.lambda$materializeMatch$1(SharedBuffer.java:305)
>>>         at java.util.HashMap.computeIfAbsent(HashMap.java:1127)
>>>         at 
>>> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.materializeMatch(SharedBuffer.java:301)
>>>         at 
>>> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.materializeMatch(SharedBuffer.java:291)
>>>         at 
>>> org.apache.flink.cep.nfa.NFA$ConditionContext.getEventsForPattern(NFA.java:811)
>>>         at 
>>> com.stellapps.contrakcep.flink.patterns.critical.AgitatorMalfunctionAfterChillingPattern$1.filter(AgitatorMalfunctionAfterChillingPattern.java:70)
>>>         at 
>>> com.stellapps.contrakcep.flink.patterns.critical.AgitatorMalfunctionAfterChillingPattern$1.filter(AgitatorMalfunctionAfterChillingPattern.java:62)
>>>         at org.apache.flink.cep.nfa.NFA.checkFilterCondition(NFA.java:742)
>>>         at org.apache.flink.cep.nfa.NFA.createDecisionGraph(NFA.java:716)
>>>         ... 10 more
>>> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>>>         at com.esotericsoftware.kryo.util.IntArray.add(IntArray.java:61)
>>>         at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:800)
>>>         at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:655)
>>>         at 
>>> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:231)
>>>         at 
>>> org.apache.flink.cep.nfa.sharedbuffer.Lockable$LockableTypeSerializer.copy(Lockable.java:120)
>>>         at 
>>> org.apache.flink.cep.nfa.sharedbuffer.Lockable$LockableTypeSerializer.copy(Lockable.java:95)
>>>         at 
>>> org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:113)
>>>         at 
>>> org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:49)
>>>         at 
>>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:287)
>>>         at 
>>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:311)
>>>         at 
>>> org.apache.flink.runtime.state.heap.HeapMapState.get(HeapMapState.java:85)
>>>         at 
>>> org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47)
>>>         at 
>>> org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.lambda$materializeMatch$1(SharedBuffer.java:303)
>>>         ... 18 more
>>> 
>>> On Fri, Sep 28, 2018 at 11:00 AM Shailesh Jain <shailesh.j...@stellapps.com 
>>> <mailto:shailesh.j...@stellapps.com>> wrote:
>>> Hi Dawid,
>>> 
>>> Thanks for your time on this. The diff should have pointed out only the top 
>>> 3 commits, but since it did not, it is possible I did not rebase my branch 
>>> against 1.4.2 correctly. I'll check this out and get back to you if I hit 
>>> the same issue again.
>>> 
>>> Thanks again,
>>> Shailesh
>>> 
>>> On Thu, Sep 27, 2018 at 1:00 PM Dawid Wysakowicz <dwysakow...@apache.org 
>>> <mailto:dwysakow...@apache.org>> wrote:
>>> Hi Shailesh,
>>> 
>>> I am afraid it is gonna be hard to help you, as this branch differs 
>>> significantly from 1.4.2 release (I've done diff across your branch and 
>>> tag/release-1.4.2). Moreover the code in the branch you've provided still 
>>> does not correspond to the lines in the exception you've posted previously. 
>>> Could you check if the problem occurs on vanilla flink as well?
>>> 
>>> Best,
>>> 
>>> Dawid
>>> 
>>> 
>>> On 27/09/18 08:22, Shailesh Jain wrote:
>>>> Hi Dawid,
>>>> 
>>>> Yes, it is version 1.4.2. We are running vanilla flink, but have added a 
>>>> couple of changes in the CEP operator specifically (top 3 commits here: 
>>>> https://github.com/jainshailesh/flink/commits/poc_on_1.4.2 
>>>> <https://github.com/jainshailesh/flink/commits/poc_on_1.4.2>). Changes 
>>>> I've made to CEP operators do not touch the checkpointing path, just 
>>>> overloading the operator for a specific way of handling event time.
>>>> 
>>>> We are hitting this in production, so I'm not sure it'll be feasible to 
>>>> move to 1.6.0 immediately, but eventually yes.
>>>> 
>>>> Thanks,
>>>> Shailesh
>>>> 
>>>> On Wed, Sep 26, 2018 at 5:44 PM Dawid Wysakowicz <dwysakow...@apache.org 
>>>> <mailto:dwysakow...@apache.org>> wrote:
>>>> Hi Shailesh,
>>>> 
>>>> Are you sure you are using version 1.4.2? Do you run a vanilla flink, or 
>>>> have you introduced some changes? I am asking cause the lines in 
>>>> stacktrace does not align with the source code for 1.4.2.
>>>> 
>>>> Also it is a different exception than the one in the issue you've linked, 
>>>> so if it is a problem than it is definitely a different one. Last thing I 
>>>> would recommend upgrading to the newest version, as we rewritten the 
>>>> SharedBuffer implementation in 1.6.0.
>>>> 
>>>> Best,
>>>> 
>>>> Dawid
>>>> 
>>>> 
>>>> On 26/09/18 13:50, Shailesh Jain wrote:
>>>>> Hi,
>>>>> 
>>>>> I think I've hit this same issue on a 3 node standalone cluster (1.4.2) 
>>>>> using HDFS (2.8.4) as state backend.
>>>>> 
>>>>> 2018-09-26 17:07:39,370 INFO  org.apache.flink.runtime.taskmanager.Task   
>>>>>                   - Attempting to fail task externally SelectCepOperator 
>>>>> (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340).
>>>>> 2018-09-26 17:07:39,370 INFO  org.apache.flink.runtime.taskmanager.Task   
>>>>>                   - SelectCepOperator (1/1) 
>>>>> (3bec4aa1ef2226c4e0c5ff7b3860d340) switched from RUNNING to FAILED.
>>>>> AsynchronousException{java.lang.Exception: Could not materialize 
>>>>> checkpoint 6 for operator SelectCepOperator (1/1).}
>>>>>     at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948)
>>>>>     at 
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>     at 
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>     at 
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>     at java.lang.Thread.run(Thread.java:748)
>>>>> Caused by: java.lang.Exception: Could not materialize checkpoint 6 for 
>>>>> operator SelectCepOperator (1/1).
>>>>>     ... 6 more
>>>>> Caused by: java.util.concurrent.ExecutionException: 
>>>>> java.lang.NullPointerException
>>>>>     at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>>>>     at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>>>>     at 
>>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>>>>>     at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>>>>>     ... 5 more
>>>>>     Suppressed: java.lang.Exception: Could not properly cancel managed 
>>>>> keyed state future.
>>>>>         at 
>>>>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939)
>>>>>         ... 5 more
>>>>>     Caused by: java.util.concurrent.ExecutionException: 
>>>>> java.lang.NullPointerException
>>>>>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>>>>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>>>>         at 
>>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>>>>>         at 
>>>>> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66)
>>>>>         at 
>>>>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89)
>>>>>         ... 7 more
>>>>>     Caused by: java.lang.NullPointerException
>>>>>         at 
>>>>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954)
>>>>>         at 
>>>>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825)
>>>>>         at 
>>>>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888)
>>>>>         at 
>>>>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820)
>>>>>         at 
>>>>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196)
>>>>>         at 
>>>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390)
>>>>>         at 
>>>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339)
>>>>>         at 
>>>>> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
>>>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>         at 
>>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>>>>>         at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>>>>>         ... 5 more
>>>>>     [CIRCULAR REFERENCE:java.lang.NullPointerException]
>>>>> 
>>>>> Any ideas on why I'm hitting this especially when this 
>>>>> (https://issues.apache.org/jira/browse/FLINK-7756 
>>>>> <https://issues.apache.org/jira/browse/FLINK-7756>) says it has been 
>>>>> fixed in 1.4.2 ?
>>>>> 
>>>>> On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio 
>>>>> <federico.dambro...@smartlab.ws <mailto:federico.dambro...@smartlab.ws>> 
>>>>> wrote:
>>>>> Thank you very much for your steady response, Kostas!
>>>>> 
>>>>> Cheers,
>>>>> Federico
>>>>> 
>>>>> 2017-11-03 16:26 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com 
>>>>> <mailto:k.klou...@data-artisans.com>>:
>>>>> Hi Federico,
>>>>> 
>>>>> Thanks for trying it out! 
>>>>> Great to hear that your problem was fixed!
>>>>> 
>>>>> The feature freeze for the release is going to be next week, and I would 
>>>>> expect 1 or 2 more weeks testing.
>>>>> So I would say in 2.5 weeks. But this is of course subject to potential 
>>>>> issues we may find during testing.
>>>>> 
>>>>> Cheers,
>>>>> Kostas
>>>>> 
>>>>>> On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio 
>>>>>> <federico.dambro...@smartlab.ws <mailto:federico.dambro...@smartlab.ws>> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hi Kostas,
>>>>>> 
>>>>>> I just tried running the same job with 1.4-SNAPSHOT for 10 minutes and 
>>>>>> it didn't crash, so that was the same underlying issue of the JIRA you 
>>>>>> linked.
>>>>>> 
>>>>>> Do you happen to know when it's expected the 1.4 stable release?
>>>>>> 
>>>>>> Thank you very much,
>>>>>> Federico
>>>>>> 
>>>>>> 2017-11-03 15:25 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com 
>>>>>> <mailto:k.klou...@data-artisans.com>>:
>>>>>> Perfect! thanks a lot!
>>>>>> 
>>>>>> Kostas
>>>>>> 
>>>>>>> On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio 
>>>>>>> <federico.dambro...@smartlab.ws 
>>>>>>> <mailto:federico.dambro...@smartlab.ws>> wrote:
>>>>>>> 
>>>>>>> Hi Kostas, 
>>>>>>> 
>>>>>>> yes, I'm using 1.3.2. I'll try the current master and I'll get back to 
>>>>>>> you.
>>>>>>> 
>>>>>>> 2017-11-03 15:21 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com 
>>>>>>> <mailto:k.klou...@data-artisans.com>>:
>>>>>>> Hi Federico,
>>>>>>> 
>>>>>>> I assume that you are using Flink 1.3, right?
>>>>>>> 
>>>>>>> In this case, in 1.4 we have fixed a bug that seems similar to your 
>>>>>>> case:
>>>>>>> https://issues.apache.org/jira/browse/FLINK-7756 
>>>>>>> <https://issues.apache.org/jira/browse/FLINK-7756>
>>>>>>> 
>>>>>>> Could you try the current master to see if it fixes your problem?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Kostas
>>>>>>> 
>>>>>>>> On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio 
>>>>>>>> <federico.dambro...@smartlab.ws 
>>>>>>>> <mailto:federico.dambro...@smartlab.ws>> wrote:
>>>>>>>> 
>>>>>>>>  Could not find id for entry:                                          
>>>>>>>>               
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Federico D'Ambrosio
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Federico D'Ambrosio
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Federico D'Ambrosio
>>>> 
>>> 
>

Re: FlinkCEP, circular references and checkpointing failures

Reply via email to