[jira] [Commented] (FLINK-4218) Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." causes task restarting

Stephan Ewen (JIRA) Fri, 23 Sep 2016 05:28:31 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516309#comment-15516309
 ]


Stephan Ewen commented on FLINK-4218:
-------------------------------------

>From the Amazon docs, the issue that we run into is this:

??Amazon S3 achieves high availability by replicating data across multiple 
servers within Amazon's data centers. If a PUT request is successful, your data 
is safely stored. However, information about the changes must replicate across 
Amazon S3, which can take some time, and so you might observe the following 
behaviors:??

??A process writes a new object to Amazon S3 and immediately lists keys within 
its bucket. Until the change is fully propagated, the object might not appear 
in the list.??

So, let's simply make the following changes.
  - The access to the file size does not fail the checkpoint. If it results in 
a "file not found", the checkpoint will report "unknown size".
  - The S3 (and S3A) filesystems have to do a retry on FileStatus operations.

The second part is not even necessary, I think, simply makes it nicer to have 
fewer "unknown" sizes.

> Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." 
> causes task restarting
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-4218
>                 URL: https://issues.apache.org/jira/browse/FLINK-4218
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.1.0
>            Reporter: Sergii Koshel
>
> Sporadically see exception as below. And restart of task because of it.
> {code:title=Exception|borderStyle=solid}
> java.lang.RuntimeException: Error triggering a checkpoint as the result of 
> receiving checkpoint barrier
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:785)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:775)
>       at 
> org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:203)
>       at 
> org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:129)
>       at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:183)
>       at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:66)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:265)
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3://<bucket_name_here>/flink/checkpoints/ece317c26960464ba5de75f3bbc38cb2/chk-8810/96eebbeb-de14-45c7-8ebb-e7cde978d6d3
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:996)
>       at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
>       at 
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFileSystem.java:351)
>       at 
> org.apache.flink.runtime.state.filesystem.AbstractFileStateHandle.getFileSize(AbstractFileStateHandle.java:93)
>       at 
> org.apache.flink.runtime.state.filesystem.FileStreamStateHandle.getStateSize(FileStreamStateHandle.java:58)
>       at 
> org.apache.flink.runtime.state.AbstractStateBackend$DataInputViewHandle.getStateSize(AbstractStateBackend.java:482)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskStateList.getStateSize(StreamTaskStateList.java:77)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:604)
>       at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$3.onEvent(StreamTask.java:779)
>       ... 8 more
> {code}
> File actually exists on S3. 
> I suppose it is related to some race conditions with S3 but would be good to 
> retry a few times before stop task execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4218) Sporadic "java.lang.RuntimeException: Error triggering a checkpoint..." causes task restarting

Reply via email to