Could it be that another process might have deleted the in progress
checkpoint file?

Cheers,
Till

On Mon, Mar 8, 2021 at 4:31 PM Yun Gao <yungao...@aliyun.com> wrote:

> Hi Navneeth,
>
> Is the attached exception the root cause for the checkpoint failure ?
> Namely is it also reported in job manager log?
>
> Also, have you enabled concurrent checkpoint?
>
> Best,
>  Yun
>
>
> ------------------Original Mail ------------------
> *Sender:*Navneeth Krishnan <reachnavnee...@gmail.com>
> *Send Date:*Mon Mar 8 13:10:46 2021
> *Recipients:*Yun Gao <yungao...@aliyun.com>
> *CC:*user <user@flink.apache.org>
> *Subject:*Re: Re: Checkpoint Error
>
>> Hi Yun,
>>
>> Thanks for the response. I checked the mounts and only the JM's and TM's
>> are mounted with this EFS. Not sure how to debug this.
>>
>> Thanks
>>
>> On Sun, Mar 7, 2021 at 8:29 PM Yun Gao <yungao...@aliyun.com> wrote:
>>
>>> Hi Navneeth,
>>>
>>> It seems from the stack that the exception is caused by the underlying
>>> EFS problems ? Have you checked
>>> if there are errors reported for EFS, or if there might be duplicate
>>> mounting for the same EFS and others
>>> have ever deleted the directory?
>>>
>>> Best,
>>> Yun
>>>
>>>
>>> ------------------Original Mail ------------------
>>> *Sender:*Navneeth Krishnan <reachnavnee...@gmail.com>
>>> *Send Date:*Sun Mar 7 15:44:59 2021
>>> *Recipients:*user <user@flink.apache.org>
>>> *Subject:*Re: Checkpoint Error
>>>
>>>> Hi All,
>>>>
>>>> Any suggestions?
>>>>
>>>> Thanks
>>>>
>>>> On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan <
>>>> reachnavnee...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We are running our streaming job on flink 1.7.2 and we are noticing
>>>>> the below error. Not sure what's causing it, any pointers would help. We
>>>>> have 10 TM's checkpointing to AWS EFS.
>>>>>
>>>>> AsynchronousException{java.lang.Exception: Could not materialize 
>>>>> checkpoint 11 for operator Processor -> Sink: KafkaSink (34/42).}at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)at
>>>>>  
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)at
>>>>>  
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)at
>>>>>  
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at 
>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at
>>>>>  
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at
>>>>>  java.lang.Thread.run(Thread.java:748)Caused by: java.lang.Exception: 
>>>>> Could not materialize checkpoint 11 for operator Processor -> Sink: 
>>>>> KafkaSink (34/42).at 
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)...
>>>>>  6 moreCaused by: java.util.concurrent.ExecutionException: 
>>>>> java.io.IOException: Could not flush and close the file system output 
>>>>> stream to 
>>>>> file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d
>>>>>  in order to obtain the stream state handleat 
>>>>> java.util.concurrent.FutureTask.report(FutureTask.java:122)at 
>>>>> java.util.concurrent.FutureTask.get(FutureTask.java:192)at 
>>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)at 
>>>>> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)at
>>>>>  
>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)...
>>>>>  5 moreCaused by: java.io.IOException: Could not flush and close the file 
>>>>> system output stream to 
>>>>> file:/mnt/checkpoints/a300d1b0fd059f3f83ce35a8042e89c8/chk-11/1cd768bd-3408-48a9-ad48-b005f66b130d
>>>>>  in order to obtain the stream state handleat 
>>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326)at
>>>>>  
>>>>> org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:767)at
>>>>>  
>>>>> org.apache.flink.runtime.state.DefaultOperatorStateBackend$DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackend.java:696)at
>>>>>  
>>>>> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:76)at
>>>>>  java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
>>>>> org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)...
>>>>>  7 moreCaused by: java.io.IOException: Stale file handleat 
>>>>> java.io.FileOutputStream.close0(Native Method)at 
>>>>> java.io.FileOutputStream.access$000(FileOutputStream.java:53)at 
>>>>> java.io.FileOutputStream$1.close(FileOutputStream.java:356)at 
>>>>> java.io.FileDescriptor.closeAll(FileDescriptor.java:212)at 
>>>>> java.io.FileOutputStream.close(FileOutputStream.java:354)at 
>>>>> org.apache.flink.core.fs.local.LocalDataOutputStream.close(LocalDataOutputStream.java:62)at
>>>>>  
>>>>> org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)at
>>>>>  
>>>>> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:312)...
>>>>>  12 more
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>

Reply via email to