Re: FlinkCEP, circular references and checkpointing failures

Shailesh Jain Thu, 27 Sep 2018 22:31:27 -0700

Hi Dawid,

Thanks for your time on this. The diff should have pointed out only the top
3 commits, but since it did not, it is possible I did not rebase my branch
against 1.4.2 correctly. I'll check this out and get back to you if I hit
the same issue again.


Thanks again,
Shailesh

On Thu, Sep 27, 2018 at 1:00 PM Dawid Wysakowicz <dwysakow...@apache.org>
wrote:

> Hi Shailesh,
>
> I am afraid it is gonna be hard to help you, as this branch differs
> significantly from 1.4.2 release (I've done diff across your branch and
> tag/release-1.4.2). Moreover the code in the branch you've provided still
> does not correspond to the lines in the exception you've posted previously.
> Could you check if the problem occurs on vanilla flink as well?
>
> Best,
>
> Dawid
>
> On 27/09/18 08:22, Shailesh Jain wrote:
>
> Hi Dawid,
>
> Yes, it is version 1.4.2. We are running vanilla flink, but have added a
> couple of changes in the CEP operator specifically (top 3 commits here:
> https://github.com/jainshailesh/flink/commits/poc_on_1.4.2). Changes I've
> made to CEP operators do not touch the checkpointing path, just overloading
> the operator for a specific way of handling event time.
>
> We are hitting this in production, so I'm not sure it'll be feasible to
> move to 1.6.0 immediately, but eventually yes.
>
> Thanks,
> Shailesh
>
> On Wed, Sep 26, 2018 at 5:44 PM Dawid Wysakowicz <dwysakow...@apache.org>
> wrote:
>
>> Hi Shailesh,
>>
>> Are you sure you are using version 1.4.2? Do you run a vanilla flink, or
>> have you introduced some changes? I am asking cause the lines in stacktrace
>> does not align with the source code for 1.4.2.
>>
>> Also it is a different exception than the one in the issue you've linked,
>> so if it is a problem than it is definitely a different one. Last thing I
>> would recommend upgrading to the newest version, as we rewritten the
>> SharedBuffer implementation in 1.6.0.
>>
>> Best,
>>
>> Dawid
>>
>> On 26/09/18 13:50, Shailesh Jain wrote:
>>
>> Hi,
>>
>> I think I've hit this same issue on a 3 node standalone cluster (1.4.2)
>> using HDFS (2.8.4) as state backend.
>>
>> 2018-09-26 17:07:39,370 INFO
>> org.apache.flink.runtime.taskmanager.Task                     - Attempting
>> to fail task externally SelectCepOperator (1/1)
>> (3bec4aa1ef2226c4e0c5ff7b3860d340).
>> 2018-09-26 17:07:39,370 INFO
>> org.apache.flink.runtime.taskmanager.Task                     -
>> SelectCepOperator (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340) switched from
>> RUNNING to FAILED.
>> AsynchronousException{java.lang.Exception: Could not materialize
>> checkpoint 6 for operator SelectCepOperator (1/1).}
>>     at
>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948)
>>     at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.Exception: Could not materialize checkpoint 6 for
>> operator SelectCepOperator (1/1).
>>     ... 6 more
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.NullPointerException
>>     at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>     at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>     at
>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>>     at
>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>>     ... 5 more
>>     Suppressed: java.lang.Exception: Could not properly cancel managed
>> keyed state future.
>>         at
>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91)
>>         at
>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976)
>>         at
>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939)
>>         ... 5 more
>>     Caused by: java.util.concurrent.ExecutionException:
>> java.lang.NullPointerException
>>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>         at
>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>>         at
>> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66)
>>         at
>> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89)
>>         ... 7 more
>>     Caused by: java.lang.NullPointerException
>>         at
>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954)
>>         at
>> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825)
>>         at
>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888)
>>         at
>> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820)
>>         at
>> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196)
>>         at
>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390)
>>         at
>> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339)
>>         at
>> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at
>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>>         at
>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>>         ... 5 more
>>     [CIRCULAR REFERENCE:java.lang.NullPointerException]
>>
>> Any ideas on why I'm hitting this especially when this (
>> https://issues.apache.org/jira/browse/FLINK-7756) says it has been fixed
>> in 1.4.2 ?
>>
>> On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio <
>> federico.dambro...@smartlab.ws> wrote:
>>
>>> Thank you very much for your steady response, Kostas!
>>>
>>> Cheers,
>>> Federico
>>>
>>> 2017-11-03 16:26 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>:
>>>
>>>> Hi Federico,
>>>>
>>>> Thanks for trying it out!
>>>> Great to hear that your problem was fixed!
>>>>
>>>> The feature freeze for the release is going to be next week, and I
>>>> would expect 1 or 2 more weeks testing.
>>>> So I would say in 2.5 weeks. But this is of course subject to potential
>>>> issues we may find during testing.
>>>>
>>>> Cheers,
>>>> Kostas
>>>>
>>>> On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio <
>>>> federico.dambro...@smartlab.ws> wrote:
>>>>
>>>> Hi Kostas,
>>>>
>>>> I just tried running the same job with 1.4-SNAPSHOT for 10 minutes and
>>>> it didn't crash, so that was the same underlying issue of the JIRA you
>>>> linked.
>>>>
>>>> Do you happen to know when it's expected the 1.4 stable release?
>>>>
>>>> Thank you very much,
>>>> Federico
>>>>
>>>> 2017-11-03 15:25 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>
>>>> :
>>>>
>>>>> Perfect! thanks a lot!
>>>>>
>>>>> Kostas
>>>>>
>>>>> On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio <
>>>>> federico.dambro...@smartlab.ws> wrote:
>>>>>
>>>>> Hi Kostas,
>>>>>
>>>>> yes, I'm using 1.3.2. I'll try the current master and I'll get back to
>>>>> you.
>>>>>
>>>>> 2017-11-03 15:21 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com
>>>>> >:
>>>>>
>>>>>> Hi Federico,
>>>>>>
>>>>>> I assume that you are using Flink 1.3, right?
>>>>>>
>>>>>> In this case, in 1.4 we have fixed a bug that seems similar to your
>>>>>> case:
>>>>>> https://issues.apache.org/jira/browse/FLINK-7756
>>>>>>
>>>>>> Could you try the current master to see if it fixes your problem?
>>>>>>
>>>>>> Thanks,
>>>>>> Kostas
>>>>>>
>>>>>> On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio <
>>>>>> federico.dambro...@smartlab.ws> wrote:
>>>>>>
>>>>>>  Could not find id for
>>>>>> entry:
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Federico D'Ambrosio
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Federico D'Ambrosio
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Federico D'Ambrosio
>>>
>>
>>
>

Re: FlinkCEP, circular references and checkpointing failures

Reply via email to