Hi Navneeth, Is the attached exception the root cause for the checkpoint failure ? Namely is it also reported in job manager log?
Also, have you enabled concurrent checkpoint? Best, Yun ------------------Original Mail ------------------ Sender:Navneeth Krishnan <reachnavnee...@gmail.com> Send Date:Mon Mar 8 13:10:46 2021 Recipients:Yun Gao <yungao...@aliyun.com> CC:user <user@flink.apache.org> Subject:Re: Re: Checkpoint Error Hi Yun, Thanks for the response. I checked the mounts and only the JM's and TM's are mounted with this EFS. Not sure how to debug this. Thanks On Sun, Mar 7, 2021 at 8:29 PM Yun Gao <yungao...@aliyun.com> wrote: Hi Navneeth, It seems from the stack that the exception is caused by the underlying EFS problems ? Have you checked if there are errors reported for EFS, or if there might be duplicate mounting for the same EFS and others have ever deleted the directory? Best, Yun ------------------Original Mail ------------------ Sender:Navneeth Krishnan <reachnavnee...@gmail.com> Send Date:Sun Mar 7 15:44:59 2021 Recipients:user <user@flink.apache.org> Subject:Re: Checkpoint Error Hi All, Any suggestions? Thanks On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan <reachnavnee...@gmail.com> wrote: Hi All, We are running our streaming job on flink 1.7.2 and we are noticing the below error. Not sure what's causing it, any pointers would help. We have 10 TM's checkpointing to AWS EFS. AsynchronousException{java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)... 6 moreCaused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat java.util.concurrent.FutureTask.report(FutureTask.java:122)at java.util.concurrent.FutureTask.get(FutureTask.java:192)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)... 5 moreCaused by: java.io.IOException: Could not flush and close the file system output stream to file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d in order to obtain the stream state handleat org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)at org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)... 7 moreCaused by: java.io.IOException: Stale file handleat java.io.FileOutputStream.close0(Native Method)at java.io.FileOutputStream.access$000(FileOutputStream.java:53)at java.io.FileOutputStream$1.close(FileOutputStream.java:356)at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)at java.io.FileOutputStream.close(FileOutputStream.java:354)at org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)... 12 more Thanks