Re: TaskManagers OOM'ing for Flink App with very large state only when restoring from checkpoint

Alexis Sarda-Espinosa Mon, 13 Sep 2021 00:05:56 -0700

I'm not very knowledgeable when it comes to Linux memory management, but do 
note that Linux (and by extension Kubernetes) takes disk IO into account when 
deciding whether a process is using more memory than it's allowed to, see e.g. 
https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d

Regards,
Alexis.

________________________________
From: Guowei Ma <guowei....@gmail.com>
Sent: Monday, September 13, 2021 8:35 AM
To: Kevin Lam <kevin....@shopify.com>
Cc: user <user@flink.apache.org>
Subject: Re: TaskManagers OOM'ing for Flink App with very large state only when 
restoring from checkpoint

Hi, Kevin

1. Could you give me some specific information, such as what version of Flink 
is you using, and is it using DataStream or SQL?
2. As far as I know, RocksDB will put state on disk, so it will not consume 
memory all the time and cause OOM in theory.
    So you can see if there are any object leaks by analyzing the Jmap of 
TaskManger after Failover.
3. There is another way, you can trigger a save point first, and then resume 
the job from the save point to see if there is still OOM,
     if not, then it is likely to be related to your application code.

Best,
Guowei

On Sat, Sep 11, 2021 at 2:01 AM Kevin Lam 
<kevin....@shopify.com<mailto:kevin....@shopify.com>> wrote:
Hi all,

We've seen scenarios where TaskManagers will begin to OOM, shortly after a job 
restore from checkpoint. Our flink app has a very large state (100s of GB) and 
we use RocksDB as a backend.

Our repro is something like this: run the job for an hour and let it accumulate 
state, kill a task manager. The job restores properly, but then minutes later 
task managers begin to be killed on K8S due to OOM, and this causes a 
degenerate state where the job restores and new OOMs cause the job to restore 
again and it never recovers.

We've tried increasing the TaskManager memory (doubled), and observed that OOMs 
still happen even when the allocated k8s container memory is not maxed out.

Can you shed some light on what happens during a restore process? How are 
checkpoints loaded, and how does this affect the memory pressure of task 
managers (that for eg. have had a task running, got it cancelled, and 
re-assigned a new task as part of restore)?

Any help is appreciated!

Re: TaskManagers OOM'ing for Flink App with very large state only when restoring from checkpoint

Reply via email to