Xi Zhang created FLINK-38225:
--------------------------------
Summary: GSBlobStorageImpl needs to add precondition check to make
503 retryable
Key: FLINK-38225
URL: https://issues.apache.org/jira/browse/FLINK-38225
Project: Flink
Issue Type: Improvement
Components: Connectors / FileSystem
Affects Versions: 1.20.0, 1.19.0, 1.18.0, 2.0.1
Reporter: Xi Zhang
We discovered that storage.writer and compose are not retrying 503 errors. It
happens quite frequently and causes checkpoint failures.
Here is our investigation findings:
* We noticed frequent checkpoint failures due to GCS 503, tuning retry
settings didn't seem to take an effect.
* We upgraded storage client to 2.52.1 and built a custom image. The new
version revealed more detailed error:
** Suppressed:
com.google.cloud.storage.RetryContext$RetryBudgetExhaustedComment: Unretryable
error (attempts: 1, maxAttempts: 12, elapsed: PT0.683S, nextBackoff:
PT1.173526768S, timeout: PT2M30S)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException:
503 Service Unavailable
* Upon checking with Google, we need to set preconditions in write and
compose, otherwise 503 error will be deemed not retryable.
* The fix is as below
{code:java}
Storage.BlobTargetOption precondition;
if (storage.get(targetBlobIdentifier.bucketName,
targetBlobIdentifier.objectName) == null) {
// For a target object that does not yet exist, set the DoesNotExist
precondition.
// This will cause the request to fail if the object is created before the
request runs.
precondition = Storage.BlobTargetOption.doesNotExist();
} else {
// If the destination already exists in your bucket, instead set a
generation-match
// precondition. This will cause the request to fail if the existing
object's generation
// changes before the request runs.
precondition =
Storage.BlobTargetOption.generationMatch(
storage.get(
targetBlobIdentifier.bucketName,
targetBlobIdentifier.objectName)
.getGeneration());
}
Storage.ComposeRequest request = builder.setTargetOptions(precondition).build();
storage.compose(request);
{code}
It needs to be added in compose:
[https://github.com/apache/flink/blob/master/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/storage/GSBlobStorageImpl.java#L157]
* Similiar fix applies to write as well:
[https://github.com/apache/flink/blob/master/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/storage/GSBlobStorageImpl.java#L64]
* We don't see checkpoint failures after the fix.
* We'd like to apply the fix to open source code, since we don't want our
image to diverge too much with the open source version. I can help contribute
if needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)