[ https://issues.apache.org/jira/browse/FLINK-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-10664: ----------------------------------- Labels: auto-deprioritized-major stale-minor (was: auto-deprioritized-major) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Minor but is unassigned and neither itself nor its Sub-Tasks have been updated for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is still Minor, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized. > Flink: Checkpointing fails with S3 exception - Please reduce your request rate > ------------------------------------------------------------------------------ > > Key: FLINK-10664 > URL: https://issues.apache.org/jira/browse/FLINK-10664 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.5.4, 1.6.1 > Reporter: Pawel Bartoszek > Priority: Minor > Labels: auto-deprioritized-major, stale-minor > > When the checkpoint is created for the job which has many operators it could > happen that Flink uploads too many checkpoint files, at the same time, to S3 > resulting in throttling from S3. > > {code:java} > Caused by: org.apache.hadoop.fs.s3a.AWSS3IOException: saving output on > flink/state-checkpoints/7bbd6495f90257e4bc037ecc08ba21a5/chk-19/4422b088-0836-4f12-bbbe-7e731da11231: > com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your > request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; > Request ID: XXXX; S3 Extended Request ID: XXX), S3 Extended Request ID: XXX: > Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error > Code: SlowDown; Request ID: 5310EA750DF8B949; S3 Extended Request ID: XXX) > at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:178) > at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:121) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108) > at > org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52) > at > org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64) > at > org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:311){code} > > Can the upload be retried with kind of back off? > -- This message was sent by Atlassian Jira (v8.20.1#820001)