Hi all, We're running a large (256 task managers with 4 task slots each) Flink Cluster with High Availability enabled, on Kubernetes, and use Google Cloud Storage (GCS) as our object storage for the HA metadata. In addition, our Flink application writes out to GCS from one of its sinks via streaming file sink + GCS connector.
We observed the following types of errors when running our application: ``` INFO: Encountered status code 429 when sending GET request to URL ' https://storage.googleapis.com/download/storage/v1/b/<redacted>/o/<redacted>checkpoints%2F00000000000000000000000000000000%2Fshared%2F13721c52-18d8-4782-80ab-1ed8a15d9ad5?alt=media&generation=1638448883568946'. Delegating to response handler for possible retry. [CONTEXT ratelimit_period="10 SECONDS [skipped: 8]" ] ``` ``` INFO: Encountered status code 503 when sending POST request to URL ' https://storage.googleapis.com/upload/storage/v1/b/<redacted>/o?uploadType=multipart'. Delegating to response handler for possible retry. ``` They typically happen upon cluster start-up, when all the task managers are registering with the jobmanager. We've also seen them occur as a result of output from our sink operator as well. Has anyone else encountered similar issues? Any practices you can suggest? Advice appreciated! Thanks