Hello,
I'm seeing the following behavior in StreamingFileSink (1.9.1) uploading to
S3.

2019-11-06 15:50:58,081 INFO
 com.quora.dataInfra.s3connector.flink.filesystem.Buckets      -* Subtask 1
checkpointing for checkpoint with id=5025 (max part counter=3406).*
2019-11-06 15:50:58,448 INFO
 org.apache.flink.streaming.api.operators.AbstractStreamOperator  - Could
not complete snapshot 5025 for operator Source: kafka_source -> (Sink:
s3_metadata_sink, Sink: s3_data_sink) (2/18).
java.io.IOException: Uploading parts failed
at
org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartUploadToComplete(RecoverableMultiPartUploadImpl.java:231)
at
org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.awaitPendingPartsUpload(RecoverableMultiPartUploadImpl.java:215)
at
org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:151)
at
org.apache.flink.fs.s3.common.writer.RecoverableMultiPartUploadImpl.snapshotAndGetRecoverable(RecoverableMultiPartUploadImpl.java:56)
...12 more
*Caused by: java.io.FileNotFoundException: upload part on
tmp/kafka/meta/auction_ads/dt=2019-11-06T15/partition_7/part-1-3403:
*org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The specified upload does not exist. The upload ID may be invalid, or the
upload may have been aborted or completed. (Service: Amazon S3; Status
Code: 404; Error Code: NoSuchUpload; Request ID: 6D4B335FE7687B51; S3
Extended Request ID:
OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=),
S3 Extended Request ID:
OOqtRkyz1O4hA+Gfn+kRyZS/XSzD5WHlQZZbU/+OIO/9paITpCJmdKFqws1dDy/d/e4EXedrVNc=:NoSuchUpload
... 10 more
...
2019-11-06 15:50:58,476 INFO  org.apache.flink.runtime.taskmanager.Task
                - Attempting to cancel task Source: kafka_source -> (Sink:
s3_metadata_sink, Sink: s3_data_sink) (2/18)
(060d4deed87f3be96f3704474a5dc3e9).

Via the S3 console, the file in question (part-1-3403) does NOT exist, but
its part file does:

*_part-1-3402_tmp_38cbdecf-e5b5-4649-9754-bb7aa008f373*
*_part-1-3403_tmp_73e2a73b-0bac-46e8-8fdf-9455903d9da0*
part-1-3395
part-1-3396
...
part-1-3401

The MPU lifecycling policy is configured to delete incomplete uploads in *3
days*, which should not be affecting this.

Attempting to restore from the most recent checkpoint, *5025*, results in
similar issues for different topics. What I am seeing in S3 is essentially
two incomplete part files, such as:

*_part-4-3441_tmp_da13ceba-a284-4353-bdd6-ef4005d382fc*

*_part-4-3442_tmp_fe0c0e00-c7f7-462f-a99f-464b2851a4cb*
And the checkpoint restore operation fails with:

*upload part on
tmp/kafka/meta/feed_features/dt=2019-11-06T15/partition_0/part-4-3441:
org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
The specified upload does not exist.*
(It does indeed not exist in S3).

Any ideas?
As it stands, this job is basically unrecoverable right now because of this
error.
Thank you

Reply via email to