[ https://issues.apache.org/jira/browse/FLINK-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668610#comment-16668610 ]
Pawel Bartoszek commented on FLINK-10664: ----------------------------------------- [~StephanEwen] I looked into flink-s3-fs-presto source code and see that org.apache.flink.fs.s3presto.S3FileSystemFactory extends AbstractS3FileSystemFactory ([https://github.com/apache/flink/blob/master/flink-filesystems/flink-s3-fs-base/src/main/java/org/apache/flink/fs/s3/common/AbstractS3FileSystemFactory.java]) which internally uses Hadoop classes. How it's different than from using Hadoop FS? > Flink: Checkpointing fails with S3 exception - Please reduce your request rate > ------------------------------------------------------------------------------ > > Key: FLINK-10664 > URL: https://issues.apache.org/jira/browse/FLINK-10664 > Project: Flink > Issue Type: Improvement > Components: JobManager, TaskManager > Affects Versions: 1.5.4, 1.6.1 > Reporter: Pawel Bartoszek > Priority: Major > > When the checkpoint is created for the job which has many operators it could > happen that Flink uploads too many checkpoint files, at the same time, to S3 > resulting in throttling from S3. > > {code:java} > Caused by: org.apache.hadoop.fs.s3a.AWSS3IOException: saving output on > flink/state-checkpoints/7bbd6495f90257e4bc037ecc08ba21a5/chk-19/4422b088-0836-4f12-bbbe-7e731da11231: > com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your > request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; > Request ID: XXXX; S3 Extended Request ID: XXX), S3 Extended Request ID: XXX: > Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error > Code: SlowDown; Request ID: 5310EA750DF8B949; S3 Extended Request ID: XXX) > at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:178) > at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:121) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108) > at > org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52) > at > org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:311){code} > > Can the upload be retried with kind of back off? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)