Re: Use of slot sharing groups causing workflow to hang

Ken Krugler Wed, 09 Sep 2020 19:48:09 -0700

Hi Til,

> On Sep 3, 2020, at 12:31 AM, Till Rohrmann <trohrm...@apache.org> wrote:
> 
> Hi Ken,
> 
> I believe that we don't have a lot if not any explicit logging about the slot 
> sharing group in the code. You can, however, learn indirectly about it by 
> looking at the required number of AllocatedSlots in the SlotPool. Also the 
> number of "multi task slot" which are created should vary because every group 
> of slot sharing tasks will create one of them. For learning about the 
> SlotPoolImpl's status, you can also take a look at SlotPoolImpl.printStatus.
> 
> For the underlying problem, I believe that Yangze could be right. How many 
> resources do you have in your cluster?


I've got a Flink MiniCluster with 12 slots. Even with only 6 pipelined
operators, each with a parallelism of 1, it still hangs while starting. So
I don't think that it's a resource issue.

One odd thing I've noticed. I've got three streams that I union together.
Two of the streams are in separate slot sharing groups, the third is not
assigned to a group. But when I check the logs, I see three "Create multi
task slot" entries. I'm wondering if unioning streams that are in different
slot sharing groups creates a problem.

Thanks,

-- Ken

> On Thu, Sep 3, 2020 at 4:25 AM Yangze Guo <karma...@gmail.com 
> <mailto:karma...@gmail.com>> wrote:
> Hi,
> 
> The failure of requesting slots usually because of the lack of
> resources. If you put part of the workflow to a specific slot sharing
> group, it may require more slots to run the workflow than before.
> Could you share logs of the ResourceManager and SlotManager, I think
> there are more clues in it.
> 
> Best,
> Yangze Guo
> 
> On Thu, Sep 3, 2020 at 4:39 AM Ken Krugler <kkrugler_li...@transpac.com 
> <mailto:kkrugler_li...@transpac.com>> wrote:
> >
> > Hi all,
> >
> > I’ve got a streaming workflow (using Flink 1.11.1) that runs fine locally 
> > (via Eclipse), with a parallelism of either 3 or 6.
> >
> > If I set up part of the workflow to use a specific (not “default”) slot 
> > sharing group with a parallelism of 3, and the remaining portions of the 
> > workflow have a parallelism of either 1 or 2, then the workflow never 
> > starts running, and eventually fails due to a slot request not being 
> > fulfilled in time.
> >
> > So I’m wondering how best to debug this.
> >
> > I don’t see any information (even at DEBUG level) being logged about which 
> > operators are in what slot sharing group, or which slots are assigned to 
> > what groups.
> >
> > Thanks,
> >
> > — Ken
> >
> > PS - I’ve looked at https://issues.apache.org/jira/browse/FLINK-8712 
> > <https://issues.apache.org/jira/browse/FLINK-8712>, and tried the approach 
> > of setting # of slots in the config, but that didn’t change anything. I see 
> > that issue is still open, so wondering what Til and Konstantin have to say 
> > about it.
> >
> > --------------------------
> > Ken Krugler
> > http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Use of slot sharing groups causing workflow to hang

Reply via email to