TM OOMKilled

Alexey Trenikhun Mon, 14 Feb 2022 13:17:48 -0800

Hello,
We run Flink 1.13.5 job in app mode in Kubernetes, 1 JM and 1 TM, we also have 
Kubernetes cron job which takes savepoint every 2 hour (14 */2 * * *), once in 
while (~1 per 2 days) TM is OOMKilled, suspiciously it happens on even hours ~4 
minutes after savepoint start (e.g. 12:18, 4:18) but I don't see failed save 
points, so I assume OOM happens right after savepoint taken. However OOMKilled 
doesn't happen on every save point, so maybe this is a random correlation.
I've reserved 2G for JVM overhead, but somehow it is not enough ? Any known 
issues with memory and savepoints? Any suggestions how to troubleshoot this?


 Final TaskExecutor Memory configuration:
   Total Process Memory:          10.000gb (10737418240 bytes)
     Total Flink Memory:          7.547gb (8103395328 bytes)
       Total JVM Heap Memory:     3.523gb (3783262149 bytes)
         Framework:               128.000mb (134217728 bytes)
         Task:                    3.398gb (3649044421 bytes)
       Total Off-heap Memory:     4.023gb (4320133179 bytes)
         Managed:                 3.019gb (3241358179 bytes)
         Total JVM Direct Memory: 1.005gb (1078775000 bytes)
           Framework:             128.000mb (134217728 bytes)
           Task:                  128.000mb (134217728 bytes)
           Network:               772.800mb (810339544 bytes)
     JVM Metaspace:               256.000mb (268435456 bytes)
     JVM Overhead:                2.203gb (2365587456 bytes)

Thanks,
Alexey

TM OOMKilled

Reply via email to