Re: FlinkCEP, circular references and checkpointing failures

Shailesh Jain Wed, 24 Oct 2018 20:42:31 -0700

Hi Dawid,

I've upgraded to flink 1.6.1 and rebased by changes against the tag 1.6.1,
the only commit on top of 1.6 is this:
https://github.com/jainshailesh/flink/commit/797e3c4af5b28263fd98fb79daaba97cabf3392c


I ran two separate identical jobs (with and without checkpointing enabled),
I'm hitting a ArrayIndexOutOfBoundsException (and sometimes NPE) *only when
checkpointing (HDFS backend) is enabled*, with the below stack trace.

I did see a similar problem with different operators here (
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/IndexOutOfBoundsException-on-deserialization-after-updating-to-1-6-1-td23933.html).
Is this a known issue which is getting addressed?

Any ideas on what could be causing this?

Thanks,
Shailesh


2018-10-24 17:04:13,365 INFO
org.apache.flink.runtime.taskmanager.Task                     -
SelectCepOperatorMixedTime (1/1) - SelectCepOperatorMixedTime (1/1)
(3d984b7919342a3886593401088ca2cd) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkRuntimeException: Failure happened in filter
function.
        at org.apache.flink.cep.nfa.NFA.createDecisionGraph(NFA.java:731)
        at org.apache.flink.cep.nfa.NFA.computeNextStates(NFA.java:541)
        at org.apache.flink.cep.nfa.NFA.doProcess(NFA.java:284)
        at org.apache.flink.cep.nfa.NFA.process(NFA.java:220)
        at
org.apache.flink.cep.operator.AbstractKeyedCEPPatternOperator.processEvent(AbstractKeyedCEPPatternOperator.java:379)
        at
org.apache.flink.cep.operator.AbstractKeyedCEPPatternMixedTimeApproachOperator.processElement(AbstractKeyedCEPPatternMixedTimeApproachOperator.java:45)
        at
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
        at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
        at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.WrappingRuntimeException:
java.lang.ArrayIndexOutOfBoundsException: -1
        at
org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.lambda$materializeMatch$1(SharedBuffer.java:305)
        at java.util.HashMap.computeIfAbsent(HashMap.java:1127)
        at
org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.materializeMatch(SharedBuffer.java:301)
        at
org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.materializeMatch(SharedBuffer.java:291)
        at
org.apache.flink.cep.nfa.NFA$ConditionContext.getEventsForPattern(NFA.java:811)
        at
com.stellapps.contrakcep.flink.patterns.critical.AgitatorMalfunctionAfterChillingPattern$1.filter(AgitatorMalfunctionAfterChillingPattern.java:70)
        at
com.stellapps.contrakcep.flink.patterns.critical.AgitatorMalfunctionAfterChillingPattern$1.filter(AgitatorMalfunctionAfterChillingPattern.java:62)
        at org.apache.flink.cep.nfa.NFA.checkFilterCondition(NFA.java:742)
        at org.apache.flink.cep.nfa.NFA.createDecisionGraph(NFA.java:716)
        ... 10 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
        at com.esotericsoftware.kryo.util.IntArray.add(IntArray.java:61)
        at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:800)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:655)
        at
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:231)
        at
org.apache.flink.cep.nfa.sharedbuffer.Lockable$LockableTypeSerializer.copy(Lockable.java:120)
        at
org.apache.flink.cep.nfa.sharedbuffer.Lockable$LockableTypeSerializer.copy(Lockable.java:95)
        at
org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:113)
        at
org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:49)
        at
org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:287)
        at
org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:311)
        at
org.apache.flink.runtime.state.heap.HeapMapState.get(HeapMapState.java:85)
        at
org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47)
        at
org.apache.flink.cep.nfa.sharedbuffer.SharedBuffer.lambda$materializeMatch$1(SharedBuffer.java:303)
        ... 18 more

On Fri, Sep 28, 2018 at 11:00 AM Shailesh Jain <shailesh.j...@stellapps.com>
wrote:

> Hi Dawid,
>
> Thanks for your time on this. The diff should have pointed out only the
> top 3 commits, but since it did not, it is possible I did not rebase my
> branch against 1.4.2 correctly. I'll check this out and get back to you if
> I hit the same issue again.
>
> Thanks again,
> Shailesh
>
> On Thu, Sep 27, 2018 at 1:00 PM Dawid Wysakowicz <dwysakow...@apache.org>
> wrote:
>
>> Hi Shailesh,
>>
>> I am afraid it is gonna be hard to help you, as this branch differs
>> significantly from 1.4.2 release (I've done diff across your branch and
>> tag/release-1.4.2). Moreover the code in the branch you've provided still
>> does not correspond to the lines in the exception you've posted previously.
>> Could you check if the problem occurs on vanilla flink as well?
>>
>> Best,
>>
>> Dawid
>>
>> On 27/09/18 08:22, Shailesh Jain wrote:
>>
>> Hi Dawid,
>>
>> Yes, it is version 1.4.2. We are running vanilla flink, but have added a
>> couple of changes in the CEP operator specifically (top 3 commits here:
>> https://github.com/jainshailesh/flink/commits/poc_on_1.4.2). Changes
>> I've made to CEP operators do not touch the checkpointing path, just
>> overloading the operator for a specific way of handling event time.
>>
>> We are hitting this in production, so I'm not sure it'll be feasible to
>> move to 1.6.0 immediately, but eventually yes.
>>
>> Thanks,
>> Shailesh
>>
>> On Wed, Sep 26, 2018 at 5:44 PM Dawid Wysakowicz <dwysakow...@apache.org>
>> wrote:
>>
>>> Hi Shailesh,
>>>
>>> Are you sure you are using version 1.4.2? Do you run a vanilla flink, or
>>> have you introduced some changes? I am asking cause the lines in stacktrace
>>> does not align with the source code for 1.4.2.
>>>
>>> Also it is a different exception than the one in the issue you've
>>> linked, so if it is a problem than it is definitely a different one. Last
>>> thing I would recommend upgrading to the newest version, as we rewritten
>>> the SharedBuffer implementation in 1.6.0.
>>>
>>> Best,
>>>
>>> Dawid
>>>
>>> On 26/09/18 13:50, Shailesh Jain wrote:
>>>
>>> Hi,
>>>
>>> I think I've hit this same issue on a 3 node standalone cluster (1.4.2)
>>> using HDFS (2.8.4) as state backend.
>>>
>>> 2018-09-26 17:07:39,370 INFO
>>> org.apache.flink.runtime.taskmanager.Task                     - Attempting
>>> to fail task externally SelectCepOperator (1/1)
>>> (3bec4aa1ef2226c4e0c5ff7b3860d340).
>>> 2018-09-26 17:07:39,370 INFO
>>> org.apache.flink.runtime.taskmanager.Task                     -
>>> SelectCepOperator (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340) switched from
>>> RUNNING to FAILED.
>>> AsynchronousException{java.lang.Exception: Could not materialize
>>> checkpoint 6 for operator SelectCepOperator (1/1).}
>>>     at
>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948)
>>>     at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>     at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>     at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>     at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.lang.Exception: Could not materialize checkpoint 6 for
>>> operator SelectCepOperator (1/1).
>>>     ... 6 more
>>> Caused by: java.util.concurrent.ExecutionException:
>>> java.lang.NullPointerException
>>>     at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>>     at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>>     at
>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>>>     at
>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>>>     ... 5 more
>>>     Suppressed: java.lang.Exception: Could not properly cancel managed
>>> keyed state future.
>>>         at
>>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91)
>>>         at
>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976)
>>>         at
>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939)
>>>         ... 5 more
>>>     Caused by: java.util.concurrent.ExecutionException:
>>> java.lang.NullPointerException
>>>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>>         at
>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>>>         at
>>> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66)
>>>         at
>>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89)
>>>         ... 7 more
>>>     Caused by: java.lang.NullPointerException
>>>         at
>>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954)
>>>         at
>>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825)
>>>         at
>>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888)
>>>         at
>>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820)
>>>         at
>>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196)
>>>         at
>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390)
>>>         at
>>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339)
>>>         at
>>> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>         at
>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>>>         at
>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>>>         ... 5 more
>>>     [CIRCULAR REFERENCE:java.lang.NullPointerException]
>>>
>>> Any ideas on why I'm hitting this especially when this (
>>> https://issues.apache.org/jira/browse/FLINK-7756) says it has been
>>> fixed in 1.4.2 ?
>>>
>>> On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio <
>>> federico.dambro...@smartlab.ws> wrote:
>>>
>>>> Thank you very much for your steady response, Kostas!
>>>>
>>>> Cheers,
>>>> Federico
>>>>
>>>> 2017-11-03 16:26 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>
>>>> :
>>>>
>>>>> Hi Federico,
>>>>>
>>>>> Thanks for trying it out!
>>>>> Great to hear that your problem was fixed!
>>>>>
>>>>> The feature freeze for the release is going to be next week, and I
>>>>> would expect 1 or 2 more weeks testing.
>>>>> So I would say in 2.5 weeks. But this is of course subject to
>>>>> potential issues we may find during testing.
>>>>>
>>>>> Cheers,
>>>>> Kostas
>>>>>
>>>>> On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio <
>>>>> federico.dambro...@smartlab.ws> wrote:
>>>>>
>>>>> Hi Kostas,
>>>>>
>>>>> I just tried running the same job with 1.4-SNAPSHOT for 10 minutes and
>>>>> it didn't crash, so that was the same underlying issue of the JIRA you
>>>>> linked.
>>>>>
>>>>> Do you happen to know when it's expected the 1.4 stable release?
>>>>>
>>>>> Thank you very much,
>>>>> Federico
>>>>>
>>>>> 2017-11-03 15:25 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com
>>>>> >:
>>>>>
>>>>>> Perfect! thanks a lot!
>>>>>>
>>>>>> Kostas
>>>>>>
>>>>>> On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio <
>>>>>> federico.dambro...@smartlab.ws> wrote:
>>>>>>
>>>>>> Hi Kostas,
>>>>>>
>>>>>> yes, I'm using 1.3.2. I'll try the current master and I'll get back
>>>>>> to you.
>>>>>>
>>>>>> 2017-11-03 15:21 GMT+01:00 Kostas Kloudas <
>>>>>> k.klou...@data-artisans.com>:
>>>>>>
>>>>>>> Hi Federico,
>>>>>>>
>>>>>>> I assume that you are using Flink 1.3, right?
>>>>>>>
>>>>>>> In this case, in 1.4 we have fixed a bug that seems similar to your
>>>>>>> case:
>>>>>>> https://issues.apache.org/jira/browse/FLINK-7756
>>>>>>>
>>>>>>> Could you try the current master to see if it fixes your problem?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kostas
>>>>>>>
>>>>>>> On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio <
>>>>>>> federico.dambro...@smartlab.ws> wrote:
>>>>>>>
>>>>>>>  Could not find id for
>>>>>>> entry:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Federico D'Ambrosio
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Federico D'Ambrosio
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Federico D'Ambrosio
>>>>
>>>
>>>
>>

Re: FlinkCEP, circular references and checkpointing failures

Reply via email to