Re: Support for controlling slot assignment based on CPU requirements

Ken Krugler Thu, 13 Jun 2019 13:12:31 -0700

Hi Xintong Song,

Thanks for the response.


I’d thought that the slotSharingGroup(“name”) method call was only available 
for streaming jobs - as I’d noted below, I’m running a batch workflow.

Or is there a way to get this to work? I see that support for slot sharing 
groups is in Flink’s runtime, versus streaming, but I don’t see how batch 
workflows set this.

Also, slot sharing is about subtasks sharing the same slot. But if I have one 
slot per TM, I’ll want to be running 32 TMs/server, and then slot sharing 
groups shouldn’t have any impact, right?

Though using N TMs/server, each with one slot, might help improve the odds of 
subtasks for my particular operator running on separate servers (since as pert 
FLINK-12122, in 1.8 they tend to clump together on the same TM if it has 
multiple slots).

Thanks again,

— Ken

> On Jun 12, 2019, at 7:02 PM, Xintong Song <tonysong...@gmail.com> wrote:
> 
> Hi Ken,
> 
> There is a discussion in issue
> <https://issues.apache.org/jira/browse/FLINK-12122> about a feature related
> to your demand. It proposes spread tasks evenly across TMs. However, the
> feature is still in progress, and it spreads all tasks evenly instead of
> specific operators.
> 
> For the time being, I would suggest to have only one slot per TM, and use slot
> sharing group
> <https://ci.apache.org/projects/flink/flink-docs-release-1.8/concepts/runtime.html#task-slots-and-resources>
> to
> make sure tasks of the same job graph vertex do not goes into the same
> slot/TM.
> 
> Thank you~
> 
> Xintong Song
> 
> 
> 
> On Thu, Jun 13, 2019 at 4:58 AM Ken Krugler <kkrugler_li...@transpac.com>
> wrote:
> 
>> Hi all,
>> 
>> I’m running a complex (batch) workflow that has a step where it trains
>> Fasttext models.
>> 
>> This is very CPU-intensive, to the point where it will use all available
>> processing power on a server.
>> 
>> The Flink configuration I’m using is one TaskManager per server, with N
>> slots == available cores.
>> 
>> So what I’d like to do is ensure that if I have N of these training
>> operators running in parallel on N TaskManagers, slot assignment happens
>> such that each TM has one such operator.
>> 
>> Unfortunately, what typically happens now is that most/all of these
>> operators get assigned to the same TM, which then struggles to stay alive
>> under that load.
>> 
>> I haven’t seen any solution to this, though I can imagine some helicopter
>> stunts that could work around the issue.
>> 
>> Any suggestions?
>> 
>> Thanks,
>> 
>> — Ken
>> 
>> PS - I took a look through the list of FLIPs <
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals>,
>> and didn’t see anything that covered this. I image it would need to be
>> something like YARN’s support for per-node vCore capacity and per-task
>> vCore requirements, but on a per-TM/per-operator basis.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Re: Support for controlling slot assignment based on CPU requirements

Reply via email to