Hi: I launch flink application on yarn with 5 task manager, every task manager has 5 slots with such script
#!/bin/sh CLASSNAME=$1 JARNAME=$2 ARUGMENTS=$3 export JVM_ARGS="${JVM_ARGS} -Dmill.env.active=aws" /usr/bin/flink run -m yarn-cluster --parallelism 15 -yn 5 -ys 3 -yjm 8192 -ytm 8192 -ynm flink-order-detection -yD env.java.opts.jobmanager='-Dmill.env.active=aws' -yD env.java.opts.taskmanager='-Dmill.env.active=aws' -c $CLASSNAME \ $JARNAME $ARUGMENTS The original flink app occupy 5 containers and 15 vcores, run for 3+ days, one of task manage killed by yarn because of memory leak and job manager start new task managers. Currently my flink app running normally on yarn, but occupy 10 containers, 28 vcores. (Application Master shows my flink job running for 75 hours, click into running job in flink web ui, it shows my job running for 28hours because of restart) In my opinion, job manager will attempt to start the failed task manager, and in the final app still use 5 containers and 15 vcores, why after restart job by yarn will occupy double resource. Any one can give me some suggestion? Regards James