Hi James, What version of Flink are you running? In 1.5.0, tasks can spread out due to changes that were introduced to support "local recovery". There is a mitigation in 1.5.1 that prevents task spread out but local recovery must be disabled [2].
Best, Gary [1] https://issues.apache.org/jira/browse/FLINK-9635 [2] https://issues.apache.org/jira/browse/FLINK-9634 On Mon, Sep 3, 2018 at 9:20 AM, James (Jian Wu) [FDS Data Platform] < james...@coupang.com> wrote: > Hi: > > > > I launch flink application on yarn with 5 task manager, every task > manager has 5 slots with such script > > > > #!/bin/sh > > CLASSNAME=$1 > > JARNAME=$2 > > ARUGMENTS=$3 > > > > export JVM_ARGS="${JVM_ARGS} -Dmill.env.active=aws" > > /usr/bin/flink run -m yarn-cluster --parallelism 15 -yn 5 -ys 3 -yjm 8192 > -ytm 8192 -ynm flink-order-detection -yD > env.java.opts.jobmanager='-Dmill.env.active=aws' > -yD env.java.opts.taskmanager='-Dmill.env.active=aws' -c $CLASSNAME \ > > $JARNAME $ARUGMENTS > > > > > > The original flink app occupy 5 containers and 15 vcores, run for 3+ days, > one of task manage killed by yarn because of memory leak and job manager > start new task managers. Currently my flink app running normally on yarn, > but occupy 10 containers, 28 vcores. (Application Master shows my flink > job running for 75 hours, click into running job in flink web ui, it shows > my job running for 28hours because of restart) > > > > In my opinion, job manager will attempt to start the failed task manager, > and in the final app still use 5 containers and 15 vcores, why after > restart job by yarn will occupy double resource. > > > > Any one can give me some suggestion? > > > > Regards > > > > James >