Hello, We run Flink 1.13.5 job in app mode in Kubernetes, 1 JM and 1 TM, we also have Kubernetes cron job which takes savepoint every 2 hour (14 */2 * * *), once in while (~1 per 2 days) TM is OOMKilled, suspiciously it happens on even hours ~4 minutes after savepoint start (e.g. 12:18, 4:18) but I don't see failed save points, so I assume OOM happens right after savepoint taken. However OOMKilled doesn't happen on every save point, so maybe this is a random correlation. I've reserved 2G for JVM overhead, but somehow it is not enough ? Any known issues with memory and savepoints? Any suggestions how to troubleshoot this?
Final TaskExecutor Memory configuration: Total Process Memory: 10.000gb (10737418240 bytes) Total Flink Memory: 7.547gb (8103395328 bytes) Total JVM Heap Memory: 3.523gb (3783262149 bytes) Framework: 128.000mb (134217728 bytes) Task: 3.398gb (3649044421 bytes) Total Off-heap Memory: 4.023gb (4320133179 bytes) Managed: 3.019gb (3241358179 bytes) Total JVM Direct Memory: 1.005gb (1078775000 bytes) Framework: 128.000mb (134217728 bytes) Task: 128.000mb (134217728 bytes) Network: 772.800mb (810339544 bytes) JVM Metaspace: 256.000mb (268435456 bytes) JVM Overhead: 2.203gb (2365587456 bytes) Thanks, Alexey