Re: Flink job can't complete initialisation because of millions of savepoint file reads

Gabor Somogyi Tue, 15 Oct 2024 00:38:54 -0700

Hi Alex,

Please see my comment here [1].


[1] https://lists.apache.org/thread/h5mv6ld4l2g4hsjszfdos9f365nh7ctf

BR,
G


On Mon, Sep 2, 2024 at 11:02 AM Alex K. <flink.user...@gmail.com> wrote:

> We have an issue where a savepoint file containing Kafka topic partitions
> offsets is requested millions of times from AWS S3. This results in the
> job crashing and then followed by a restart and crashing again. We have
> tracked the high number of reads (~3 millions) to Kafka topic partitions
> (~40k) multiplied by job parallelism (70 slots). We are using Flink
> 1.19.0, KafkaSource and savepoints/checkpoints are stored in AWS S3.
>
> We increased the state.storage.fs.memory-threshold to 700kb, which results
> in the Kafka topic partition offsets being written in the _metadata
> savepoint file and implicitly eliminates the problem from above. Our topics
> and partitions are increasing weekly so we will reach the
> state.storage.fs.memory-threshold max value limit of 1mb soon.
>
> Is this behaviour expected and in such case could it be optimised by
> reducing the high number of reads, by caching the file or by some other
> configuration we are not aware of?
>
> Thank you
>
>

Re: Flink job can't complete initialisation because of millions of savepoint file reads

Reply via email to