Hi, If your YARN cluster uses fair scheduler, maybe you can check if the yarn.scheduler.fair.assignmultiple<https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/FairScheduler.html> config is set. If that’s the case, then adjusting yarn.scheduler.fair.dynamic.max.assign and yarn.scheduler.fair.max.assign could be helpful. Also, AFAIK, flink does not exert extra control of distribution of yarn apps on different nodes. The key diff between flink and spark is that most flink jobs are unbounded while spark jobs are bounded. It is possible that under same YARN scheduling strategy, the final distribution of apps after some time is different.
Best, Biao Geng From: Lu Niu <qqib...@gmail.com> Date: Thursday, September 7, 2023 at 12:17 AM To: Geng Biao <biaoge...@gmail.com> Cc: Chen Zhanghao <zhanghao.c...@outlook.com>, Weihua Hu <huweihua....@gmail.com>, Kenan Kılıçtepe <kkilict...@gmail.com>, user <user@flink.apache.org> Subject: Re: Uneven TM Distribution of Flink on YARN Hi, Thanks for all your help. Are there any other insights? Best Lu On Wed, Aug 30, 2023 at 11:29 AM Lu Niu <qqib...@gmail.com<mailto:qqib...@gmail.com>> wrote: No. we don't use yarn.taskmanager.node-label Best Lu On Tue, Aug 29, 2023 at 12:17 AM Geng Biao <biaoge...@gmail.com<mailto:biaoge...@gmail.com>> wrote: Maybe you can check if you have set yarn.taskmanager.node-label for some flink jobs? Best, Biao Geng 发送自 Outlook for iOS<https://aka.ms/o0ukef> ________________________________ 发件人: Chen Zhanghao <zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com>> 发送时间: Tuesday, August 29, 2023 12:14:53 PM 收件人: Lu Niu <qqib...@gmail.com<mailto:qqib...@gmail.com>>; Weihua Hu <huweihua....@gmail.com<mailto:huweihua....@gmail.com>> 抄送: Kenan Kılıçtepe <kkilict...@gmail.com<mailto:kkilict...@gmail.com>>; user <user@flink.apache.org<mailto:user@flink.apache.org>> 主题: Re: Uneven TM Distribution of Flink on YARN CCing @Weihua Hu<mailto:huweihua....@gmail.com> , who is an expert on this. Do you have any ideas on the phenomenon here? Best, Zhanghao Chen ________________________________ From: Lu Niu <qqib...@gmail.com<mailto:qqib...@gmail.com>> Sent: Tuesday, August 29, 2023 12:11:35 PM To: Chen Zhanghao <zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com>> Cc: Kenan Kılıçtepe <kkilict...@gmail.com<mailto:kkilict...@gmail.com>>; user <user@flink.apache.org<mailto:user@flink.apache.org>> Subject: Re: Uneven TM Distribution of Flink on YARN Thanks for your reply. The interesting fact is that we also managed spark on yarn. However. Only the flink cluster is having the issue. I am wondering whether there is a difference in the implementation on flink side. Best Lu On Mon, Aug 28, 2023 at 8:38 PM Chen Zhanghao <zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com>> wrote: Hi Lu Niu, TM distribution on YARN nodes is managed by YARN RM and is out of the scope of Flink. On the other hand, cluster.evenly-spread-out-slots forces even distribution of tasks among Flink TMs, and has nothing to do with your concerns. Also, the config currently only supports Standalone mode Flink clusters, and does not take effect on a Flink cluster on YARN. Best, Zhanghao Chen ________________________________ 发件人: Lu Niu <qqib...@gmail.com<mailto:qqib...@gmail.com>> 发送时间: 2023年8月29日 4:30 收件人: Kenan Kılıçtepe <kkilict...@gmail.com<mailto:kkilict...@gmail.com>> 抄送: user <user@flink.apache.org<mailto:user@flink.apache.org>> 主题: Re: Uneven TM Distribution of Flink on YARN Thanks for the reply. We've already set cluster.evenly-spread-out-slots = true Best Lu On Mon, Aug 28, 2023 at 1:23 PM Kenan Kılıçtepe <kkilict...@gmail.com<mailto:kkilict...@gmail.com>> wrote: Have you checked config param cluster.evenly-spread-out-slots ? On Mon, Aug 28, 2023 at 10:31 PM Lu Niu <qqib...@gmail.com<mailto:qqib...@gmail.com>> wrote: Hi, Flink users We have recently observed that the allocation of Flink TaskManagers in our YARN cluster is not evenly distributed. We would like to hear your thoughts on this matter. 1. Our setup includes Flink version 1.15.1 and Hadoop 2.10.0. 2. The uneven distribution is that out of a 370-node YARN cluster, there are 16 nodes with either 0 or 1 vCore available, while 110 nodes have more than 10 vCores available. Is such behavior expected? If not, is there a fix provided in Flink? Thanks! Best Lu