To add to this, attempting to restore from the most recent manually triggered *savepoint* results in a similar, yet slightly different error:
java.io.FileNotFoundException: upload part on *tmp/kafka/meta/ads_action_log_kafka_uncounted/dt=2019-11-06T00/partition_6/part-4-2158*: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchUpload Looking into the S3, I see that *two files with the same part number exist. * _part-4-2158_tmp_03c7ebaa-a9e5-455a-b501-731badc36765 part-4-2158 And again, I cannot recover the job from this prior state. Thanks so much for any input - would love to understand what is going on. Happy to provide full logs if needed. On Wed, Nov 6, 2019 at 11:52 AM Harrison Xu <h...@quora.com> wrote: > Hello, > I'm seeing the following behavior in StreamingFileSink (1.9.1) uploading > to S3. > > 2019-11-06 15:50:58,081 INFO > com.quora.dataInfra.s3connector.flink.filesystem.Buckets -* Subtask > 1 checkpointing for checkpoint with id=5025 (max part counter=3406).* > 2019-11-06 15:50:58,448 INFO > org.apache.flink.streaming.api.operators.AbstractStreamOperator - Could > not complete snapshot 5025 for operator Source: kafka_source -> (Sink: > s3_metadata_sink, Sink: s3_data_sink) (2/18). > java.io.IOException: Uploading parts failed > at > org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartUploadToComplete(RecoverableMultiPartUploadImpl.java:231) > at > org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartsUpload(RecoverableMultiPartUploadImpl.java:215) > at > org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:151) > at > org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:56) > ...12 more > *Caused by: java.io.FileNotFoundException: upload part on > tmp/kafka/meta/auction_ads/dt=2019-11-06T15/partition_7/part-1-3403: > *org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The specified upload does not exist. The upload ID may be invalid, or the > upload may have been aborted or completed. (Service: Amazon S3; Status > Code: 404; Error Code: NoSuchUpload; Request ID: 6D4B335FE7687B51; S3 > Extended Request ID: > OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=), > S3 Extended Request ID: > OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=:NoSuchUpload > ... 10 more > ... > 2019-11-06 15:50:58,476 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to cancel task Source: kafka_source -> > (Sink: s3_metadata_sink, Sink: s3_data_sink) (2/18) > (060d4deed87f3be96f3704474a5dc3e9). > > Via the S3 console, the file in question (part-1-3403) does NOT exist, but > its part file does: > > *_part-1-3402_tmp_38cbdecf-e5b5-4649-9754-bb7aa008f373* > *_part-1-3403_tmp_73e2a73b-0bac-46e8-8fdf-9455903d9da0* > part-1-3395 > part-1-3396 > ... > part-1-3401 > > The MPU lifecycling policy is configured to delete incomplete uploads in *3 > days*, which should not be affecting this. > > Attempting to restore from the most recent checkpoint, *5025*, results in > similar issues for different topics. What I am seeing in S3 is essentially > two incomplete part files, such as: > > *_part-4-3441_tmp_da13ceba-a284-4353-bdd6-ef4005d382fc* > > *_part-4-3442_tmp_fe0c0e00-c7f7-462f-a99f-464b2851a4cb* > And the checkpoint restore operation fails with: > > *upload part on > tmp/kafka/meta/feed_features/dt=2019-11-06T15/partition_0/part-4-3441: > org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The specified upload does not exist.* > (It does indeed not exist in S3). > > Any ideas? > As it stands, this job is basically unrecoverable right now because of > this error. > Thank you > > > > >