Hi, Kevin

1. Could you give me some specific information, such as what version of
Flink is you using, and is it using DataStream or SQL?
2. As far as I know, RocksDB will put state on disk, so it will not consume
memory all the time and cause OOM in theory.
    So you can see if there are any object leaks by analyzing the Jmap of
TaskManger after Failover.
3. There is another way, you can trigger a save point first, and then
resume the job from the save point to see if there is still OOM,
     if not, then it is likely to be related to your application code.

Best,
Guowei


On Sat, Sep 11, 2021 at 2:01 AM Kevin Lam <kevin....@shopify.com> wrote:

> Hi all,
>
> We've seen scenarios where TaskManagers will begin to OOM, shortly after a
> job restore from checkpoint. Our flink app has a very large state (100s of
> GB) and we use RocksDB as a backend.
>
> Our repro is something like this: run the job for an hour and let it
> accumulate state, kill a task manager. The job restores properly, but then
> minutes later task managers begin to be killed on K8S due to OOM, and this
> causes a degenerate state where the job restores and new OOMs cause the job
> to restore again and it never recovers.
>
> We've tried increasing the TaskManager memory (doubled), and observed that
> OOMs still happen even when the allocated k8s container memory is not maxed
> out.
>
> Can you shed some light on what happens during a restore process? How are
> checkpoints loaded, and how does this affect the memory pressure of task
> managers (that for eg. have had a task running, got it cancelled, and
> re-assigned a new task as part of restore)?
>
> Any help is appreciated!
>
>

Reply via email to