Hi all,

We're running a large (256 task managers with 4 task slots each) Flink
Cluster with High Availability enabled, on Kubernetes, and use Google Cloud
Storage (GCS) as our object storage for the HA metadata. In addition, our
Flink application writes out to GCS from one of its sinks via streaming
file sink + GCS connector.

We observed the following types of errors when running our application:

```

INFO: Encountered status code 429 when sending GET request to URL '
https://storage.googleapis.com/download/storage/v1/b/<redacted>/o/<redacted>checkpoints%2F00000000000000000000000000000000%2Fshared%2F13721c52-18d8-4782-80ab-1ed8a15d9ad5?alt=media&generation=1638448883568946'.
Delegating to response handler for possible retry. [CONTEXT
ratelimit_period="10 SECONDS [skipped: 8]" ]

```

```
 INFO: Encountered status code 503 when sending POST request to URL '
https://storage.googleapis.com/upload/storage/v1/b/<redacted>/o?uploadType=multipart'.
Delegating to response handler for possible retry.
```

They typically happen upon cluster start-up, when all the task managers are
registering with the jobmanager. We've also seen them occur as a result of
output from our sink operator as well.

Has anyone else encountered similar issues? Any practices you can suggest?

Advice appreciated!

Thanks

Reply via email to