Hi Ken, There is a discussion in issue <https://issues.apache.org/jira/browse/FLINK-12122> about a feature related to your demand. It proposes spread tasks evenly across TMs. However, the feature is still in progress, and it spreads all tasks evenly instead of specific operators.
For the time being, I would suggest to have only one slot per TM, and use slot sharing group <https://ci.apache.org/projects/flink/flink-docs-release-1.8/concepts/runtime.html#task-slots-and-resources> to make sure tasks of the same job graph vertex do not goes into the same slot/TM. Thank you~ Xintong Song On Thu, Jun 13, 2019 at 4:58 AM Ken Krugler <kkrugler_li...@transpac.com> wrote: > Hi all, > > I’m running a complex (batch) workflow that has a step where it trains > Fasttext models. > > This is very CPU-intensive, to the point where it will use all available > processing power on a server. > > The Flink configuration I’m using is one TaskManager per server, with N > slots == available cores. > > So what I’d like to do is ensure that if I have N of these training > operators running in parallel on N TaskManagers, slot assignment happens > such that each TM has one such operator. > > Unfortunately, what typically happens now is that most/all of these > operators get assigned to the same TM, which then struggles to stay alive > under that load. > > I haven’t seen any solution to this, though I can imagine some helicopter > stunts that could work around the issue. > > Any suggestions? > > Thanks, > > — Ken > > PS - I took a look through the list of FLIPs < > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals>, > and didn’t see anything that covered this. I image it would need to be > something like YARN’s support for per-node vCore capacity and per-task > vCore requirements, but on a per-TM/per-operator basis. > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > Custom big data solutions & training > Flink, Solr, Hadoop, Cascading & Cassandra > >