Hi all,

We've seen scenarios where TaskManagers will begin to OOM, shortly after a
job restore from checkpoint. Our flink app has a very large state (100s of
GB) and we use RocksDB as a backend.

Our repro is something like this: run the job for an hour and let it
accumulate state, kill a task manager. The job restores properly, but then
minutes later task managers begin to be killed on K8S due to OOM, and this
causes a degenerate state where the job restores and new OOMs cause the job
to restore again and it never recovers.

We've tried increasing the TaskManager memory (doubled), and observed that
OOMs still happen even when the allocated k8s container memory is not maxed
out.

Can you shed some light on what happens during a restore process? How are
checkpoints loaded, and how does this affect the memory pressure of task
managers (that for eg. have had a task running, got it cancelled, and
re-assigned a new task as part of restore)?

Any help is appreciated!

Reply via email to