Hi there, We were using flink 1.11.2 in production with a large setting. The job runs fine for a couple of days and ends up with a restart loop caused by YARN container memory kill. This is not observed while running against 1.9.1 with the same setting. Here is JVM environment passed to 1.11 as well as 1.9.1 job
env.java.opts.taskmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500 > -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 > -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1 > -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails > -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log' > env.java.opts.jobmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500 > -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 > -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1 > -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails > -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log' > After primitive investigation, we found this might not be related to jvm heap space usage nor gc issue. Meanwhile, we observed jvm non heap usage on some containers keep rising while job fails into restart loop as stated below. [image: image.png] >From a configuration perspective, we would like to learn how the task manager handles classloading and (unloading?) when we set include-user-jar to first. Is there suggestions how we can have a better understanding of how the new memory model introduced in 1.10 affects this issue? cluster.evenly-spread-out-slots: true zookeeper.sasl.disable: true yarn.per-job-cluster.include-user-jar: first yarn.properties-file.location: /usr/local/hadoop/etc/hadoop/ Thanks, Chen
