Hello, I'm seeing the following behavior in StreamingFileSink (1.9.1) uploading to S3.
2019-11-06 15:50:58,081 INFO com.quora.dataInfra.s3connector.flink.filesystem.Buckets -* Subtask 1 checkpointing for checkpoint with id=5025 (max part counter=3406).* 2019-11-06 15:50:58,448 INFO org.apache.flink.streaming.api.operators.AbstractStreamOperator - Could not complete snapshot 5025 for operator Source: kafka_source -> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18). java.io.IOException: Uploading parts failed at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartUploadToComplete(RecoverableMultiPartUploadImpl.java:231) at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartsUpload(RecoverableMultiPartUploadImpl.java:215) at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:151) at org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:56) ...12 more *Caused by: java.io.FileNotFoundException: upload part on tmp/kafka/meta/auction_ads/dt=2019-11-06T15/partition_7/part-1-3403: *org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchUpload; Request ID: 6D4B335FE7687B51; S3 Extended Request ID: OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=), S3 Extended Request ID: OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=:NoSuchUpload ... 10 more ... 2019-11-06 15:50:58,476 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task Source: kafka_source -> (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18) (060d4deed87f3be96f3704474a5dc3e9). Via the S3 console, the file in question (part-1-3403) does NOT exist, but its part file does: *_part-1-3402_tmp_38cbdecf-e5b5-4649-9754-bb7aa008f373* *_part-1-3403_tmp_73e2a73b-0bac-46e8-8fdf-9455903d9da0* part-1-3395 part-1-3396 ... part-1-3401 The MPU lifecycling policy is configured to delete incomplete uploads in *3 days*, which should not be affecting this. Attempting to restore from the most recent checkpoint, *5025*, results in similar issues for different topics. What I am seeing in S3 is essentially two incomplete part files, such as: *_part-4-3441_tmp_da13ceba-a284-4353-bdd6-ef4005d382fc* *_part-4-3442_tmp_fe0c0e00-c7f7-462f-a99f-464b2851a4cb* And the checkpoint restore operation fails with: *upload part on tmp/kafka/meta/feed_features/dt=2019-11-06T15/partition_0/part-4-3441: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist.* (It does indeed not exist in S3). Any ideas? As it stands, this job is basically unrecoverable right now because of this error. Thank you