Re: FlinkCEP, circular references and checkpointing failures

Dawid Wysakowicz Wed, 26 Sep 2018 05:15:01 -0700

Hi Shailesh,

Are you sure you are using version 1.4.2? Do you run a vanilla flink, or
have you introduced some changes? I am asking cause the lines in
stacktrace does not align with the source code for 1.4.2.


Also it is a different exception than the one in the issue you've
linked, so if it is a problem than it is definitely a different one.
Last thing I would recommend upgrading to the newest version, as we
rewritten the SharedBuffer implementation in 1.6.0.

Best,

Dawid


On 26/09/18 13:50, Shailesh Jain wrote:
> Hi,
>
> I think I've hit this same issue on a 3 node standalone cluster
> (1.4.2) using HDFS (2.8.4) as state backend.
>
> 2018-09-26 17:07:39,370 INFO 
> org.apache.flink.runtime.taskmanager.Task                     -
> Attempting to fail task externally SelectCepOperator (1/1)
> (3bec4aa1ef2226c4e0c5ff7b3860d340).
> 2018-09-26 17:07:39,370 INFO 
> org.apache.flink.runtime.taskmanager.Task                     -
> SelectCepOperator (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340) switched
> from RUNNING to FAILED.
> AsynchronousException{java.lang.Exception: Could not materialize
> checkpoint 6 for operator SelectCepOperator (1/1).}
>     at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.Exception: Could not materialize checkpoint 6 for
> operator SelectCepOperator (1/1).
>     ... 6 more
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.NullPointerException
>     at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>     at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>     at
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>     at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>     ... 5 more
>     Suppressed: java.lang.Exception: Could not properly cancel managed
> keyed state future.
>         at
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91)
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976)
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939)
>         ... 5 more
>     Caused by: java.util.concurrent.ExecutionException:
> java.lang.NullPointerException
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>         at
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
>         at
> org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66)
>         at
> org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89)
>         ... 7 more
>     Caused by: java.lang.NullPointerException
>         at
> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954)
>         at
> org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825)
>         at
> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888)
>         at
> org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820)
>         at
> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196)
>         at
> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390)
>         at
> org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339)
>         at
> org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894)
>         ... 5 more
>     [CIRCULAR REFERENCE:java.lang.NullPointerException]
>
> Any ideas on why I'm hitting this especially when this
> (https://issues.apache.org/jira/browse/FLINK-7756) says it has been
> fixed in 1.4.2 ?
>
> On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio
> <federico.dambro...@smartlab.ws
> <mailto:federico.dambro...@smartlab.ws>> wrote:
>
>     Thank you very much for your steady response, Kostas!
>
>     Cheers,
>     Federico
>
>     2017-11-03 16:26 GMT+01:00 Kostas Kloudas
>     <k.klou...@data-artisans.com <mailto:k.klou...@data-artisans.com>>:
>
>         Hi Federico,
>
>         Thanks for trying it out! 
>         Great to hear that your problem was fixed!
>
>         The feature freeze for the release is going to be next week,
>         and I would expect 1 or 2 more weeks testing.
>         So I would say in 2.5 weeks. But this is of course subject to
>         potential issues we may find during testing.
>
>         Cheers,
>         Kostas
>
>>         On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio
>>         <federico.dambro...@smartlab.ws
>>         <mailto:federico.dambro...@smartlab.ws>> wrote:
>>
>>         Hi Kostas,
>>
>>         I just tried running the same job with 1.4-SNAPSHOT for 10
>>         minutes and it didn't crash, so that was the same underlying
>>         issue of the JIRA you linked.
>>
>>         Do you happen to know when it's expected the 1.4 stable release?
>>
>>         Thank you very much,
>>         Federico
>>
>>         2017-11-03 15:25 GMT+01:00 Kostas Kloudas
>>         <k.klou...@data-artisans.com
>>         <mailto:k.klou...@data-artisans.com>>:
>>
>>             Perfect! thanks a lot!
>>
>>             Kostas
>>
>>>             On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio
>>>             <federico.dambro...@smartlab.ws
>>>             <mailto:federico.dambro...@smartlab.ws>> wrote:
>>>
>>>             Hi Kostas,
>>>
>>>             yes, I'm using 1.3.2. I'll try the current master and
>>>             I'll get back to you.
>>>
>>>             2017-11-03 15:21 GMT+01:00 Kostas Kloudas
>>>             <k.klou...@data-artisans.com
>>>             <mailto:k.klou...@data-artisans.com>>:
>>>
>>>                 Hi Federico,
>>>
>>>                 I assume that you are using Flink 1.3, right?
>>>
>>>                 In this case, in 1.4 we have fixed a bug that seems
>>>                 similar to your case:
>>>                 https://issues.apache.org/jira/browse/FLINK-7756
>>>
>>>                 Could you try the current master to see if it fixes
>>>                 your problem?
>>>
>>>                 Thanks,
>>>                 Kostas
>>>
>>>>                 On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio
>>>>                 <federico.dambro...@smartlab.ws
>>>>                 <mailto:federico.dambro...@smartlab.ws>> wrote:
>>>>
>>>>                  Could not find id for
>>>>                 entry:                                                     
>>>>    
>>>
>>>
>>>
>>>
>>>             -- 
>>>             Federico D'Ambrosio
>>
>>
>>
>>
>>         -- 
>>         Federico D'Ambrosio
>
>
>
>
>     -- 
>     Federico D'Ambrosio
>

signature.asc
Description: OpenPGP digital signature

Re: FlinkCEP, circular references and checkpointing failures

Reply via email to