Re: TaskManagers OOM'ing for Flink App with very large state only when restoring from checkpoint

Kevin Lam Mon, 13 Sep 2021 06:27:16 -0700

Thanks for your replies Alexis and Guowei.

We're using 1.13.1 version of Flink, and using the DataStream API.


I'll try the savepoint, and take a look at that IO article, thank you.

Please let me know if anything else comes to mind!

On Mon, Sep 13, 2021 at 3:05 AM Alexis Sarda-Espinosa <
alexis.sarda-espin...@microfocus.com> wrote:

> I'm not very knowledgeable when it comes to Linux memory management, but
> do note that Linux (and by extension Kubernetes) takes disk IO into account
> when deciding whether a process is using more memory than it's allowed to,
> see e.g.
> https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
>
> Regards,
> Alexis.
>
> ------------------------------
> *From:* Guowei Ma <guowei....@gmail.com>
> *Sent:* Monday, September 13, 2021 8:35 AM
> *To:* Kevin Lam <kevin....@shopify.com>
> *Cc:* user <user@flink.apache.org>
> *Subject:* Re: TaskManagers OOM'ing for Flink App with very large state
> only when restoring from checkpoint
>
> Hi, Kevin
>
> 1. Could you give me some specific information, such as what version of
> Flink is you using, and is it using DataStream or SQL?
> 2. As far as I know, RocksDB will put state on disk, so it will not
> consume memory all the time and cause OOM in theory.
>     So you can see if there are any object leaks by analyzing the Jmap of
> TaskManger after Failover.
> 3. There is another way, you can trigger a save point first, and then
> resume the job from the save point to see if there is still OOM,
>      if not, then it is likely to be related to your application code.
>
> Best,
> Guowei
>
>
> On Sat, Sep 11, 2021 at 2:01 AM Kevin Lam <kevin....@shopify.com> wrote:
>
> Hi all,
>
> We've seen scenarios where TaskManagers will begin to OOM, shortly after a
> job restore from checkpoint. Our flink app has a very large state (100s of
> GB) and we use RocksDB as a backend.
>
> Our repro is something like this: run the job for an hour and let it
> accumulate state, kill a task manager. The job restores properly, but then
> minutes later task managers begin to be killed on K8S due to OOM, and this
> causes a degenerate state where the job restores and new OOMs cause the job
> to restore again and it never recovers.
>
> We've tried increasing the TaskManager memory (doubled), and observed that
> OOMs still happen even when the allocated k8s container memory is not maxed
> out.
>
> Can you shed some light on what happens during a restore process? How are
> checkpoints loaded, and how does this affect the memory pressure of task
> managers (that for eg. have had a task running, got it cancelled, and
> re-assigned a new task as part of restore)?
>
> Any help is appreciated!
>
>

Re: TaskManagers OOM'ing for Flink App with very large state only when restoring from checkpoint

Reply via email to