Thanks for the links! We've tried the `gs.writer.chunk.size` before and found it didn't make a meaningful difference unfortunately. The hadoop-connector link you've sent I think is actually not applicable since the gcs Filesystem connector isn't using the hadoop implementation but instead the CloudStorage HTTP client[1].
That said, I did take a moment to test the patch created in FLINK-32877 [2] and found increasing the read/connect timeout past the default 20 seconds may be useful. Looking at traces coming from the HTTP client it looks like our restart issue coincides with when PUT operations are taking > 20s as that ticket also reported. For now, I've decided to focus my efforts on getting that ticket moved along if possible since it seems to be aligned. I'm not yet sure if we actually need to customize the retry timeouts from their defaults; that I'm still investigating. -Dylan [1]: https://github.com/apache/flink/blob/163b9cca6d2ccac0ff89dd985e3232667ddfb14f/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/storage/GSBlobStorageImpl.java#L46 [2]: https://issues.apache.org/jira/browse/FLINK-32877 On Tue, Apr 2, 2024 at 6:37 PM Asimansu Bera <asimansu.b...@gmail.com> wrote: > Hello Dylan, > > I'm not an expert. > > There are many configuration settings(tuning) which could be setup via > flink configuration. Pls refer to the second link below - specifically > retry options. > > > https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/filesystems/gcs/ > > https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.18/gcs/CONFIGURATION.md > > Thanks- > -A > > On Tue, Apr 2, 2024 at 1:13 PM Dylan Fontana via user < > user@flink.apache.org> wrote: > >> Hey Flink Users, >> >> We've been facing an issue with GCS that I'm keen to hear the community's >> thoughts or insights on. >> >> We're using the GCS FileSystem on a FileSink to write parquets in our >> Flink app. We're finding sporadic instances of >> `com.google.cloud.storage.StorageException: Read timed out` that cause our >> job to restart. While we have tolerance in place for failed checkpoints, >> this causes many more failures/restarts as compared to other FileSystems >> like AWS or Azure we use. We've tried tuning the size of the files we write >> but found no improvement; our parquets are already "tiny" - _many_ parquets >> on the order of 1-10KB. Following multiple stack traces, we see the >> exception raised from multiple parts of the sink lifecycle: >> FileWriter::prepareCommit, FileWriter::write, and FileCommitter::commit. >> >> Our hypothesis is sporadic failures from GCS HTTP APIs that aren't >> getting retried correctly or need a longer timeout than the default (20 >> seconds for Read timeouts, 50 seconds for Retries overall). This problem is >> infrequent enough that it's hard to reproduce/test; it comes and goes on >> how noisy it is. >> >> I noticed we can't tune any google-cloud-storage parameters via >> flink-config; there's FLINK-32877[1] which proposed adding Read/Connection >> Timeout parameters for the HTTPTransportOptions[2] but it's still open. I >> also noticed there's more we can change like what gets retried in the >> StorageRetryStrategy[3] and the RetrySettings[4]. Ultimately I'm thinking >> of creating an alternate FileSystemFactory in our deployment (under a >> different scheme/plugin) to test how tweaking these options in the >> StorageOptions.Builder[5] call works out. >> >> Have other GCS FileSink users hit these exceptions? What did you do? >> Anything else we might need to consider? >> >> -Dylan >> >> >> [1]: https://issues.apache.org/jira/browse/FLINK-32877 >> [2]: >> https://cloud.google.com/java/docs/reference/google-cloud-core/latest/com.google.cloud.http.HttpTransportOptions >> [3]: >> https://cloud.google.com/java/docs/reference/google-cloud-storage/latest/com.google.cloud.storage.StorageRetryStrategy >> [4]: >> https://github.com/googleapis/sdk-platform-java/blob/a94c2f0e8a99f0ddf17106cbc8117cefe6b0e127/java-core/google-cloud-core/src/main/java/com/google/cloud/ServiceOptions.java#L787 >> [5]: >> https://github.com/apache/flink/blob/163b9cca6d2ccac0ff89dd985e3232667ddfb14f/flink-filesystems/flink-gs-fs-hadoop/src/main/java/org/apache/flink/fs/gs/GSFileSystemFactory.java#L94 >> >