Hi, I think I've hit this same issue on a 3 node standalone cluster (1.4.2) using HDFS (2.8.4) as state backend.
2018-09-26 17:07:39,370 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally SelectCepOperator (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340). 2018-09-26 17:07:39,370 INFO org.apache.flink.runtime.taskmanager.Task - SelectCepOperator (1/1) (3bec4aa1ef2226c4e0c5ff7b3860d340) switched from RUNNING to FAILED. AsynchronousException{java.lang.Exception: Could not materialize checkpoint 6 for operator SelectCepOperator (1/1).} at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.Exception: Could not materialize checkpoint 6 for operator SelectCepOperator (1/1). ... 6 more Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894) ... 5 more Suppressed: java.lang.Exception: Could not properly cancel managed keyed state future. at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:91) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:976) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:939) ... 5 more Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) at org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:66) at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:89) ... 7 more Caused by: java.lang.NullPointerException at org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:954) at org.apache.flink.cep.nfa.SharedBuffer$SharedBufferSerializer.serialize(SharedBuffer.java:825) at org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:888) at org.apache.flink.cep.nfa.NFA$NFASerializer.serialize(NFA.java:820) at org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeMappingsInKeyGroup(CopyOnWriteStateTableSnapshot.java:196) at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:390) at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:339) at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:894) ... 5 more [CIRCULAR REFERENCE:java.lang.NullPointerException] Any ideas on why I'm hitting this especially when this ( https://issues.apache.org/jira/browse/FLINK-7756) says it has been fixed in 1.4.2 ? On Sat, Nov 4, 2017 at 12:34 AM Federico D'Ambrosio < federico.dambro...@smartlab.ws> wrote: > Thank you very much for your steady response, Kostas! > > Cheers, > Federico > > 2017-11-03 16:26 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>: > >> Hi Federico, >> >> Thanks for trying it out! >> Great to hear that your problem was fixed! >> >> The feature freeze for the release is going to be next week, and I would >> expect 1 or 2 more weeks testing. >> So I would say in 2.5 weeks. But this is of course subject to potential >> issues we may find during testing. >> >> Cheers, >> Kostas >> >> On Nov 3, 2017, at 4:22 PM, Federico D'Ambrosio < >> federico.dambro...@smartlab.ws> wrote: >> >> Hi Kostas, >> >> I just tried running the same job with 1.4-SNAPSHOT for 10 minutes and it >> didn't crash, so that was the same underlying issue of the JIRA you linked. >> >> Do you happen to know when it's expected the 1.4 stable release? >> >> Thank you very much, >> Federico >> >> 2017-11-03 15:25 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>: >> >>> Perfect! thanks a lot! >>> >>> Kostas >>> >>> On Nov 3, 2017, at 3:23 PM, Federico D'Ambrosio < >>> federico.dambro...@smartlab.ws> wrote: >>> >>> Hi Kostas, >>> >>> yes, I'm using 1.3.2. I'll try the current master and I'll get back to >>> you. >>> >>> 2017-11-03 15:21 GMT+01:00 Kostas Kloudas <k.klou...@data-artisans.com>: >>> >>>> Hi Federico, >>>> >>>> I assume that you are using Flink 1.3, right? >>>> >>>> In this case, in 1.4 we have fixed a bug that seems similar to your >>>> case: >>>> https://issues.apache.org/jira/browse/FLINK-7756 >>>> >>>> Could you try the current master to see if it fixes your problem? >>>> >>>> Thanks, >>>> Kostas >>>> >>>> On Nov 3, 2017, at 3:12 PM, Federico D'Ambrosio < >>>> federico.dambro...@smartlab.ws> wrote: >>>> >>>> Could not find id for >>>> entry: >>>> >>>> >>>> >>> >>> >>> -- >>> Federico D'Ambrosio >>> >>> >>> >> >> >> -- >> Federico D'Ambrosio >> >> >> > > > -- > Federico D'Ambrosio >