Hi:
I launch flink application on yarn with 5 task manager, every task manager
has 5 slots with such script
#!/bin/sh
CLASSNAME=$1
JARNAME=$2
ARUGMENTS=$3
export JVM_ARGS="${JVM_ARGS} -Dmill.env.active=aws"
/usr/bin/flink run -m yarn-cluster --parallelism 15 -yn 5 -ys 3 -yjm 8192 -ytm
8192 -ynm flink-order-detection -yD
env.java.opts.jobmanager='-Dmill.env.active=aws' -yD
env.java.opts.taskmanager='-Dmill.env.active=aws' -c $CLASSNAME \
$JARNAME $ARUGMENTS
The original flink app occupy 5 containers and 15 vcores, run for 3+ days, one
of task manage killed by yarn because of memory leak and job manager start new
task managers. Currently my flink app running normally on yarn, but occupy 10
containers, 28 vcores. (Application Master shows my flink job running for 75
hours, click into running job in flink web ui, it shows my job running for
28hours because of restart)
In my opinion, job manager will attempt to start the failed task manager, and
in the final app still use 5 containers and 15 vcores, why after restart job by
yarn will occupy double resource.
Any one can give me some suggestion?
Regards
James