Hi James,

What version of Flink are you running? In 1.5.0, tasks can spread out due to
changes that were introduced to support "local recovery". There is a
mitigation in 1.5.1 that prevents task spread out but local recovery must be
disabled [2].

Best,
Gary

[1] https://issues.apache.org/jira/browse/FLINK-9635
[2] https://issues.apache.org/jira/browse/FLINK-9634

On Mon, Sep 3, 2018 at 9:20 AM, James (Jian Wu) [FDS Data Platform] <
james...@coupang.com> wrote:

> Hi:
>
>
>
>   I launch flink application on yarn with 5 task manager, every task
> manager has 5 slots with such script
>
>
>
> #!/bin/sh
>
> CLASSNAME=$1
>
> JARNAME=$2
>
> ARUGMENTS=$3
>
>
>
> export JVM_ARGS="${JVM_ARGS} -Dmill.env.active=aws"
>
> /usr/bin/flink run -m yarn-cluster --parallelism 15  -yn 5 -ys 3 -yjm 8192
> -ytm 8192  -ynm flink-order-detection -yD 
> env.java.opts.jobmanager='-Dmill.env.active=aws'
> -yD env.java.opts.taskmanager='-Dmill.env.active=aws'  -c $CLASSNAME   \
>
> $JARNAME $ARUGMENTS
>
>
>
>
>
> The original flink app occupy 5 containers and 15 vcores, run for 3+ days,
> one of task manage killed by yarn because of memory leak and job manager
> start new task managers. Currently my flink app running normally on yarn,
>  but occupy 10 containers, 28 vcores. (Application Master shows my flink
> job running for 75 hours, click into running job in flink web ui, it shows
> my job running for 28hours because of restart)
>
>
>
> In my opinion, job manager will attempt to start the failed task manager,
> and in the final app still use 5 containers and 15 vcores, why after
> restart job by yarn will occupy double resource.
>
>
>
> Any one can give me some suggestion?
>
>
>
> Regards
>
>
>
> James
>

Reply via email to