I'm not very knowledgeable when it comes to Linux memory management, but do note that Linux (and by extension Kubernetes) takes disk IO into account when deciding whether a process is using more memory than it's allowed to, see e.g. https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
Regards, Alexis. ________________________________ From: Guowei Ma <guowei....@gmail.com> Sent: Monday, September 13, 2021 8:35 AM To: Kevin Lam <kevin....@shopify.com> Cc: user <user@flink.apache.org> Subject: Re: TaskManagers OOM'ing for Flink App with very large state only when restoring from checkpoint Hi, Kevin 1. Could you give me some specific information, such as what version of Flink is you using, and is it using DataStream or SQL? 2. As far as I know, RocksDB will put state on disk, so it will not consume memory all the time and cause OOM in theory. So you can see if there are any object leaks by analyzing the Jmap of TaskManger after Failover. 3. There is another way, you can trigger a save point first, and then resume the job from the save point to see if there is still OOM, if not, then it is likely to be related to your application code. Best, Guowei On Sat, Sep 11, 2021 at 2:01 AM Kevin Lam <kevin....@shopify.com<mailto:kevin....@shopify.com>> wrote: Hi all, We've seen scenarios where TaskManagers will begin to OOM, shortly after a job restore from checkpoint. Our flink app has a very large state (100s of GB) and we use RocksDB as a backend. Our repro is something like this: run the job for an hour and let it accumulate state, kill a task manager. The job restores properly, but then minutes later task managers begin to be killed on K8S due to OOM, and this causes a degenerate state where the job restores and new OOMs cause the job to restore again and it never recovers. We've tried increasing the TaskManager memory (doubled), and observed that OOMs still happen even when the allocated k8s container memory is not maxed out. Can you shed some light on what happens during a restore process? How are checkpoints loaded, and how does this affect the memory pressure of task managers (that for eg. have had a task running, got it cancelled, and re-assigned a new task as part of restore)? Any help is appreciated!