Hi Kevin, this happens only when the pipeline is started up from savepoint / retained checkpoint right? Guessing from the "path" you've shared it seems like a RockDB based retained checkpoint. In this case all task managers need to pull state files from the object storage in order to restore. This can indeed be a heavy operation especially when restore a large state with high parallelism.
Recovery from failure should be faster (with DefaultScheduler) as we can re-use the local files that are already present on TaskManagers. How large is the state you're restoring from / how many TMs does the job consume / what is the parallelism? Also things could get even worse if the parallelism that has been used for taking the checkpoint is different from the one you're trying to restore with (especially with RocksDB). Best, D. On Thu, Dec 2, 2021 at 4:29 PM Kevin Lam <kevin....@shopify.com> wrote: > Hi all, > > We're running a large (256 task managers with 4 task slots each) Flink > Cluster with High Availability enabled, on Kubernetes, and use Google Cloud > Storage (GCS) as our object storage for the HA metadata. In addition, our > Flink application writes out to GCS from one of its sinks via streaming > file sink + GCS connector. > > We observed the following types of errors when running our application: > > ``` > > INFO: Encountered status code 429 when sending GET request to URL ' > https://storage.googleapis.com/download/storage/v1/b/<redacted>/o/<redacted>checkpoints%2F00000000000000000000000000000000%2Fshared%2F13721c52-18d8-4782-80ab-1ed8a15d9ad5?alt=media&generation=1638448883568946'. > Delegating to response handler for possible retry. [CONTEXT > ratelimit_period="10 SECONDS [skipped: 8]" ] > > ``` > > ``` > INFO: Encountered status code 503 when sending POST request to URL ' > https://storage.googleapis.com/upload/storage/v1/b/<redacted>/o?uploadType=multipart'. > Delegating to response handler for possible retry. > ``` > > They typically happen upon cluster start-up, when all the task managers > are registering with the jobmanager. We've also seen them occur as a result > of output from our sink operator as well. > > Has anyone else encountered similar issues? Any practices you can suggest? > > Advice appreciated! > > Thanks >