Hello Dylan,

I'm not an expert.

There are many configuration settings(tuning) which could be setup via
flink configuration. Pls refer to the second link below - specifically
retry options.

https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/filesystems/gcs/
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.18/gcs/CONFIGURATION.md

Thanks-
-A

On Tue, Apr 2, 2024 at 1:13 PM Dylan Fontana via user <user@flink.apache.org>
wrote:

> Hey Flink Users,
>
> We've been facing an issue with GCS that I'm keen to hear the community's
> thoughts or insights on.
>
> We're using the GCS FileSystem on a FileSink to write parquets in our
> Flink app. We're finding sporadic instances of
> `com.google.cloud.storage.StorageException: Read timed out` that cause our
> job to restart. While we have tolerance in place for failed checkpoints,
> this causes many more failures/restarts as compared to other FileSystems
> like AWS or Azure we use. We've tried tuning the size of the files we write
> but found no improvement; our parquets are already "tiny" - _many_ parquets
> on the order of 1-10KB. Following multiple stack traces, we see the
> exception raised from multiple parts of the sink lifecycle:
> FileWriter::prepareCommit, FileWriter::write, and FileCommitter::commit.
>
> Our hypothesis is sporadic failures from GCS HTTP APIs that aren't getting
> retried correctly or need a longer timeout than the default (20 seconds for
> Read timeouts, 50 seconds for Retries overall). This problem is infrequent
> enough that it's hard to reproduce/test; it comes and goes on how noisy it
> is.
>
> I noticed we can't tune any google-cloud-storage parameters via
> flink-config; there's FLINK-32877[1] which proposed adding Read/Connection
> Timeout parameters for the HTTPTransportOptions[2] but it's still open. I
> also noticed there's more we can change like what gets retried in the
> StorageRetryStrategy[3] and the RetrySettings[4]. Ultimately I'm thinking
> of creating an alternate FileSystemFactory in our deployment (under a
> different scheme/plugin) to test how tweaking these options in the
> StorageOptions.Builder[5] call works out.
>
> Have other GCS FileSink users hit these exceptions? What did you do?
> Anything else we might need to consider?
>
> -Dylan
>
>
> [1]: https://issues.apache.org/jira/browse/FLINK-32877
> [2]:
> https://cloud.google.com/java/docs/reference/google-cloud-core/latest/com.google.cloud.http.HttpTransportOptions
> [3]:
> https://cloud.google.com/java/docs/reference/google-cloud-storage/latest/com.google.cloud.storage.StorageRetryStrategy
> [4]:
> https://github.com/googleapis/sdk-platform-java/blob/a94c2f0e8a99f0ddf17106cbc8117cefe6b0e127/java-core/google-cloud-core/src/main/java/com/google/cloud/ServiceOptions.java#L787
> [5]:
> https://github.com/apache/flink/blob/163b9cca6d2ccac0ff89dd985e3232667ddfb14f/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java#L94
>

Reply via email to