Not sure if you've seen this, but Flinks file systems do support connection
limiting.
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/common/#connection-limiting
Seth
On Wed, Dec 8, 2021 at 12:18 PM Kevin Lam wrote:
> Hey David,
>
> Thanks for the response. The
Hey David,
Thanks for the response. The retry eventually succeeds, but I was wondering
if there was anything that people in the community have done to avoid
GCS/S3 rate-limiting issues. The retries do result in it taking longer for
all the task managers to recover and register.
On Mon, Dec 6, 202
Hi Kevin,
Flink comes with two schedulers for streaming:
- Default
- Adaptive (opt-in)
Adaptive is still in experimental phase and doesn't support local recover.
You're most likely using the first one, so you should be OK.
Can you elaborate on this a bit? We aren't changing the parallelism when
HI David,
Thanks for your response.
What's the DefaultScheduler you're referring to? Is that available in Flink
1.13.1 (the version we are using)?
How large is the state you're restoring from / how many TMs does the job
> consume / what is the parallelism?
Our checkpoint is about 900GB, and we
Hi Kevin,
this happens only when the pipeline is started up from savepoint / retained
checkpoint right? Guessing from the "path" you've shared it seems like a
RockDB based retained checkpoint. In this case all task managers need to
pull state files from the object storage in order to restore. This
Hi all,
We're running a large (256 task managers with 4 task slots each) Flink
Cluster with High Availability enabled, on Kubernetes, and use Google Cloud
Storage (GCS) as our object storage for the HA metadata. In addition, our
Flink application writes out to GCS from one of its sinks via streami