Hi all, We're debugging an issue with OOMs that occurs on our jobs shortly after a restore from checkpoint. Our application is running on kubernetes and uses RocksDB as it's state backend.
We reproduced the issue on a small cluster of 2 task managers. If we killed a single task manager, we noticed that after restoring from checkpoint, the untouched task manager has an elevated memory footprint (see the blue line for the surviving task manager): [image: image.png] If we kill the newest TM (yellow line) again, after restoring the surviving task manager gets OOM killed. We looked at the OOMKiller Report and it seems that the memory is not coming from the JVM but we're unsure of the source. It seems like something is allocating native memory that the JVM is not aware of. We're suspicious of RocksDB. Has anyone seen this kind of issue before? Is it possible there's some kind of memory pressure or memory leak coming from RocksDB that only presents itself when a job is restarted? Perhaps something isn't cleaned up? Any help would be appreciated.