Flink job can't complete initialisation because of millions of savepoint file reads

Alex K. Mon, 02 Sep 2024 02:02:00 -0700

We have an issue where a savepoint file containing Kafka topic partitions
offsets is requested millions of times from AWS S3. This results in the job
crashing and then followed by a restart and crashing again. We have tracked
the high number of reads (~3 millions) to Kafka topic partitions (~40k)
multiplied by job parallelism (70 slots). We are using Flink 1.19.0,
KafkaSource and savepoints/checkpoints are stored in AWS S3.


We increased the state.storage.fs.memory-threshold to 700kb, which results
in the Kafka topic partition offsets being written in the _metadata
savepoint file and implicitly eliminates the problem from above. Our topics
and partitions are increasing weekly so we will reach the
state.storage.fs.memory-threshold max value limit of 1mb soon.

Is this behaviour expected and in such case could it be optimised by
reducing the high number of reads, by caching the file or by some other
configuration we are not aware of?

Thank you

Flink job can't complete initialisation because of millions of savepoint file reads

Reply via email to