Hi all, We've seen scenarios where TaskManagers will begin to OOM, shortly after a job restore from checkpoint. Our flink app has a very large state (100s of GB) and we use RocksDB as a backend.
Our repro is something like this: run the job for an hour and let it accumulate state, kill a task manager. The job restores properly, but then minutes later task managers begin to be killed on K8S due to OOM, and this causes a degenerate state where the job restores and new OOMs cause the job to restore again and it never recovers. We've tried increasing the TaskManager memory (doubled), and observed that OOMs still happen even when the allocated k8s container memory is not maxed out. Can you shed some light on what happens during a restore process? How are checkpoints loaded, and how does this affect the memory pressure of task managers (that for eg. have had a task running, got it cancelled, and re-assigned a new task as part of restore)? Any help is appreciated!