Hello Dylan, I'm not an expert.
There are many configuration settings(tuning) which could be setup via flink configuration. Pls refer to the second link below - specifically retry options. https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/filesystems/gcs/ https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.18/gcs/CONFIGURATION.md Thanks- -A On Tue, Apr 2, 2024 at 1:13 PM Dylan Fontana via user <user@flink.apache.org> wrote: > Hey Flink Users, > > We've been facing an issue with GCS that I'm keen to hear the community's > thoughts or insights on. > > We're using the GCS FileSystem on a FileSink to write parquets in our > Flink app. We're finding sporadic instances of > `com.google.cloud.storage.StorageException: Read timed out` that cause our > job to restart. While we have tolerance in place for failed checkpoints, > this causes many more failures/restarts as compared to other FileSystems > like AWS or Azure we use. We've tried tuning the size of the files we write > but found no improvement; our parquets are already "tiny" - _many_ parquets > on the order of 1-10KB. Following multiple stack traces, we see the > exception raised from multiple parts of the sink lifecycle: > FileWriter::prepareCommit, FileWriter::write, and FileCommitter::commit. > > Our hypothesis is sporadic failures from GCS HTTP APIs that aren't getting > retried correctly or need a longer timeout than the default (20 seconds for > Read timeouts, 50 seconds for Retries overall). This problem is infrequent > enough that it's hard to reproduce/test; it comes and goes on how noisy it > is. > > I noticed we can't tune any google-cloud-storage parameters via > flink-config; there's FLINK-32877[1] which proposed adding Read/Connection > Timeout parameters for the HTTPTransportOptions[2] but it's still open. I > also noticed there's more we can change like what gets retried in the > StorageRetryStrategy[3] and the RetrySettings[4]. Ultimately I'm thinking > of creating an alternate FileSystemFactory in our deployment (under a > different scheme/plugin) to test how tweaking these options in the > StorageOptions.Builder[5] call works out. > > Have other GCS FileSink users hit these exceptions? What did you do? > Anything else we might need to consider? > > -Dylan > > > [1]: https://issues.apache.org/jira/browse/FLINK-32877 > [2]: > https://cloud.google.com/java/docs/reference/google-cloud-core/latest/com.google.cloud.http.HttpTransportOptions > [3]: > https://cloud.google.com/java/docs/reference/google-cloud-storage/latest/com.google.cloud.storage.StorageRetryStrategy > [4]: > https://github.com/googleapis/sdk-platform-java/blob/a94c2f0e8a99f0ddf17106cbc8117cefe6b0e127/java-core/google-cloud-core/src/main/java/com/google/cloud/ServiceOptions.java#L787 > [5]: > https://github.com/apache/flink/blob/163b9cca6d2ccac0ff89dd985e3232667ddfb14f/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java#L94 >