Thanks for your replies Alexis and Guowei. We're using 1.13.1 version of Flink, and using the DataStream API.
I'll try the savepoint, and take a look at that IO article, thank you. Please let me know if anything else comes to mind! On Mon, Sep 13, 2021 at 3:05 AM Alexis Sarda-Espinosa < alexis.sarda-espin...@microfocus.com> wrote: > I'm not very knowledgeable when it comes to Linux memory management, but > do note that Linux (and by extension Kubernetes) takes disk IO into account > when deciding whether a process is using more memory than it's allowed to, > see e.g. > https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d > > Regards, > Alexis. > > ------------------------------ > *From:* Guowei Ma <guowei....@gmail.com> > *Sent:* Monday, September 13, 2021 8:35 AM > *To:* Kevin Lam <kevin....@shopify.com> > *Cc:* user <user@flink.apache.org> > *Subject:* Re: TaskManagers OOM'ing for Flink App with very large state > only when restoring from checkpoint > > Hi, Kevin > > 1. Could you give me some specific information, such as what version of > Flink is you using, and is it using DataStream or SQL? > 2. As far as I know, RocksDB will put state on disk, so it will not > consume memory all the time and cause OOM in theory. > So you can see if there are any object leaks by analyzing the Jmap of > TaskManger after Failover. > 3. There is another way, you can trigger a save point first, and then > resume the job from the save point to see if there is still OOM, > if not, then it is likely to be related to your application code. > > Best, > Guowei > > > On Sat, Sep 11, 2021 at 2:01 AM Kevin Lam <kevin....@shopify.com> wrote: > > Hi all, > > We've seen scenarios where TaskManagers will begin to OOM, shortly after a > job restore from checkpoint. Our flink app has a very large state (100s of > GB) and we use RocksDB as a backend. > > Our repro is something like this: run the job for an hour and let it > accumulate state, kill a task manager. The job restores properly, but then > minutes later task managers begin to be killed on K8S due to OOM, and this > causes a degenerate state where the job restores and new OOMs cause the job > to restore again and it never recovers. > > We've tried increasing the TaskManager memory (doubled), and observed that > OOMs still happen even when the allocated k8s container memory is not maxed > out. > > Can you shed some light on what happens during a restore process? How are > checkpoints loaded, and how does this affect the memory pressure of task > managers (that for eg. have had a task running, got it cancelled, and > re-assigned a new task as part of restore)? > > Any help is appreciated! > >