Hi:

  I launch flink application on yarn with 5 task manager, every task manager 
has 5 slots with such script

#!/bin/sh
CLASSNAME=$1
JARNAME=$2
ARUGMENTS=$3

export JVM_ARGS="${JVM_ARGS} -Dmill.env.active=aws"
/usr/bin/flink run -m yarn-cluster --parallelism 15  -yn 5 -ys 3 -yjm 8192 -ytm 
8192  -ynm flink-order-detection -yD 
env.java.opts.jobmanager='-Dmill.env.active=aws'  -yD 
env.java.opts.taskmanager='-Dmill.env.active=aws'  -c $CLASSNAME   \
$JARNAME $ARUGMENTS


The original flink app occupy 5 containers and 15 vcores, run for 3+ days, one 
of task manage killed by yarn because of memory leak and job manager start new 
task managers. Currently my flink app running normally on yarn,  but occupy 10 
containers, 28 vcores. (Application Master shows my flink job running for 75 
hours, click into running job in flink web ui, it shows my job running for 
28hours because of restart)

In my opinion, job manager will attempt to start the failed task manager, and 
in the final app still use 5 containers and 15 vcores, why after restart job by 
yarn will occupy double resource.

Any one can give me some suggestion?

Regards

James

Reply via email to