Could it be that another process might have deleted the in progress checkpoint file?
Cheers, Till On Mon, Mar 8, 2021 at 4:31 PM Yun Gao <yungao...@aliyun.com> wrote: > Hi Navneeth, > > Is the attached exception the root cause for the checkpoint failure ? > Namely is it also reported in job manager log? > > Also, have you enabled concurrent checkpoint? > > Best, > Yun > > > ------------------Original Mail ------------------ > *Sender:*Navneeth Krishnan <reachnavnee...@gmail.com> > *Send Date:*Mon Mar 8 13:10:46 2021 > *Recipients:*Yun Gao <yungao...@aliyun.com> > *CC:*user <user@flink.apache.org> > *Subject:*Re: Re: Checkpoint Error > >> Hi Yun, >> >> Thanks for the response. I checked the mounts and only the JM's and TM's >> are mounted with this EFS. Not sure how to debug this. >> >> Thanks >> >> On Sun, Mar 7, 2021 at 8:29 PM Yun Gao <yungao...@aliyun.com> wrote: >> >>> Hi Navneeth, >>> >>> It seems from the stack that the exception is caused by the underlying >>> EFS problems ? Have you checked >>> if there are errors reported for EFS, or if there might be duplicate >>> mounting for the same EFS and others >>> have ever deleted the directory? >>> >>> Best, >>> Yun >>> >>> >>> ------------------Original Mail ------------------ >>> *Sender:*Navneeth Krishnan <reachnavnee...@gmail.com> >>> *Send Date:*Sun Mar 7 15:44:59 2021 >>> *Recipients:*user <user@flink.apache.org> >>> *Subject:*Re: Checkpoint Error >>> >>>> Hi All, >>>> >>>> Any suggestions? >>>> >>>> Thanks >>>> >>>> On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan < >>>> reachnavnee...@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> We are running our streaming job on flink 1.7.2 and we are noticing >>>>> the below error. Not sure what's causing it, any pointers would help. We >>>>> have 10 TM's checkpointing to AWS EFS. >>>>> >>>>> AsynchronousException{java.lang.Exception: Could not materialize >>>>> checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}at >>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)at >>>>> >>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)at >>>>> >>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)at >>>>> >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266)at >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at >>>>> >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at >>>>> java.lang.Thread.run(Thread.java:748)Caused by: java.lang.Exception: >>>>> Could not materialize checkpoint 11 for operator Processor -> Sink: >>>>> KafkaSink (34/42).at >>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)... >>>>> 6 moreCaused by: java.util.concurrent.ExecutionException: >>>>> java.io.IOException: Could not flush and close the file system output >>>>> stream to >>>>> file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d >>>>> in order to obtain the stream state handleat >>>>> java.util.concurrent.FutureTask.report(FutureTask.java:122)at >>>>> java.util.concurrent.FutureTask.get(FutureTask.java:192)at >>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)at >>>>> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)at >>>>> >>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)... >>>>> 5 moreCaused by: java.io.IOException: Could not flush and close the file >>>>> system output stream to >>>>> file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d >>>>> in order to obtain the stream state handleat >>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)at >>>>> >>>>> org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)at >>>>> >>>>> org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)at >>>>> >>>>> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)at >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266)at >>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)... >>>>> 7 moreCaused by: java.io.IOException: Stale file handleat >>>>> java.io.FileOutputStream.close0(Native Method)at >>>>> java.io.FileOutputStream.access$000(FileOutputStream.java:53)at >>>>> java.io.FileOutputStream$1.close(FileOutputStream.java:356)at >>>>> java.io.FileDescriptor.closeAll(FileDescriptor.java:212)at >>>>> java.io.FileOutputStream.close(FileOutputStream.java:354)at >>>>> org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)at >>>>> >>>>> org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)at >>>>> >>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)... >>>>> 12 more >>>>> >>>>> >>>>> Thanks >>>>> >>>>>