[ https://issues.apache.org/jira/browse/FLINK-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516309#comment-15516309 ]
Stephan Ewen commented on FLINK-4218: ------------------------------------- >From the Amazon docs, the issue that we run into is this: ??Amazon S3 achieves high availability by replicating data across multiple servers within Amazon's data centers. If a PUT request is successful, your data is safely stored. However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:?? ??A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.?? So, let's simply make the following changes. - The access to the file size does not fail the checkpoint. If it results in a "file not found", the checkpoint will report "unknown size". - The S3 (and S3A) filesystems have to do a retry on FileStatus operations. The second part is not even necessary, I think, simply makes it nicer to have fewer "unknown" sizes. > Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." > causes task restarting > ---------------------------------------------------------------------------------------------- > > Key: FLINK-4218 > URL: https://issues.apache.org/jira/browse/FLINK-4218 > Project: Flink > Issue Type: Improvement > Affects Versions: 1.1.0 > Reporter: Sergii Koshel > > Sporadically see exception as below. And restart of task because of it. > {code:title=Exception|borderStyle=solid} > java.lang.RuntimeException: Error triggering a checkpoint as the result of > receiving checkpoint barrier > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:785) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:775) > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:203) > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:129) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:183) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:66) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:265) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3://<bucket_name_here>/flink/checkpoints/ece317c26960464ba5de75f3bbc38cb2/chk-8810/96eebbeb-de14-45c7-8ebb-e7cde978d6d3 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:996) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77) > at > org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:351) > at > org.apache.flink.runtime.state.filesystem.AbstractFileStateHandle.getFileSize(AbstractFileStateHandle.java:93) > at > org.apache.flink.runtime.state.filesystem.FileStreamStateHandle.getStateSize(FileStreamStateHandle.java:58) > at > org.apache.flink.runtime.state.AbstractStateBackend$DataInputViewHandle.getStateSize(AbstractStateBackend.java:482) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskStateList.getStateSize(StreamTaskStateList.java:77) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:604) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:779) > ... 8 more > {code} > File actually exists on S3. > I suppose it is related to some race conditions with S3 but would be good to > retry a few times before stop task execution. -- This message was sent by Atlassian JIRA (v6.3.4#6332)