Hi Xintong Song, Thanks for the response.
I’d thought that the slotSharingGroup(“name”) method call was only available for streaming jobs - as I’d noted below, I’m running a batch workflow. Or is there a way to get this to work? I see that support for slot sharing groups is in Flink’s runtime, versus streaming, but I don’t see how batch workflows set this. Also, slot sharing is about subtasks sharing the same slot. But if I have one slot per TM, I’ll want to be running 32 TMs/server, and then slot sharing groups shouldn’t have any impact, right? Though using N TMs/server, each with one slot, might help improve the odds of subtasks for my particular operator running on separate servers (since as pert FLINK-12122, in 1.8 they tend to clump together on the same TM if it has multiple slots). Thanks again, — Ken > On Jun 12, 2019, at 7:02 PM, Xintong Song <tonysong...@gmail.com> wrote: > > Hi Ken, > > There is a discussion in issue > <https://issues.apache.org/jira/browse/FLINK-12122> about a feature related > to your demand. It proposes spread tasks evenly across TMs. However, the > feature is still in progress, and it spreads all tasks evenly instead of > specific operators. > > For the time being, I would suggest to have only one slot per TM, and use slot > sharing group > <https://ci.apache.org/projects/flink/flink-docs-release-1.8/concepts/runtime.html#task-slots-and-resources> > to > make sure tasks of the same job graph vertex do not goes into the same > slot/TM. > > Thank you~ > > Xintong Song > > > > On Thu, Jun 13, 2019 at 4:58 AM Ken Krugler <kkrugler_li...@transpac.com> > wrote: > >> Hi all, >> >> I’m running a complex (batch) workflow that has a step where it trains >> Fasttext models. >> >> This is very CPU-intensive, to the point where it will use all available >> processing power on a server. >> >> The Flink configuration I’m using is one TaskManager per server, with N >> slots == available cores. >> >> So what I’d like to do is ensure that if I have N of these training >> operators running in parallel on N TaskManagers, slot assignment happens >> such that each TM has one such operator. >> >> Unfortunately, what typically happens now is that most/all of these >> operators get assigned to the same TM, which then struggles to stay alive >> under that load. >> >> I haven’t seen any solution to this, though I can imagine some helicopter >> stunts that could work around the issue. >> >> Any suggestions? >> >> Thanks, >> >> — Ken >> >> PS - I took a look through the list of FLIPs < >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals>, >> and didn’t see anything that covered this. I image it would need to be >> something like YARN’s support for per-node vCore capacity and per-task >> vCore requirements, but on a per-TM/per-operator basis. -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra