Re: Use of slot sharing groups causing workflow to hang

Yangze Guo Wed, 09 Sep 2020 20:23:09 -0700

Hi, Ken

>From the RM perspective, could you share the following logs:
- "Request slot with profile {} for job {} with allocation id {}.".
- "Requesting new slot [{}] and profile {} with allocation id {} from
resource manager."
This will help to figure out how many slots your job indeed requests.
And probably help to figure out what the ExecutionGraph finally looks
like.



Best,
Yangze Guo

On Thu, Sep 10, 2020 at 10:47 AM Ken Krugler
<kkrugler_li...@transpac.com> wrote:
>
> Hi Til,
>
> On Sep 3, 2020, at 12:31 AM, Till Rohrmann <trohrm...@apache.org> wrote:
>
> Hi Ken,
>
> I believe that we don't have a lot if not any explicit logging about the slot 
> sharing group in the code. You can, however, learn indirectly about it by 
> looking at the required number of AllocatedSlots in the SlotPool. Also the 
> number of "multi task slot" which are created should vary because every group 
> of slot sharing tasks will create one of them. For learning about the 
> SlotPoolImpl's status, you can also take a look at SlotPoolImpl.printStatus.
>
> For the underlying problem, I believe that Yangze could be right. How many 
> resources do you have in your cluster?
>
>
> I've got a Flink MiniCluster with 12 slots. Even with only 6 pipelined
> operators, each with a parallelism of 1, it still hangs while starting. So
> I don't think that it's a resource issue.
>
> One odd thing I've noticed. I've got three streams that I union together.
> Two of the streams are in separate slot sharing groups, the third is not
> assigned to a group. But when I check the logs, I see three "Create multi
> task slot" entries. I'm wondering if unioning streams that are in different
> slot sharing groups creates a problem.
>
> Thanks,
>
> -- Ken
>
> On Thu, Sep 3, 2020 at 4:25 AM Yangze Guo <karma...@gmail.com> wrote:
>>
>> Hi,
>>
>> The failure of requesting slots usually because of the lack of
>> resources. If you put part of the workflow to a specific slot sharing
>> group, it may require more slots to run the workflow than before.
>> Could you share logs of the ResourceManager and SlotManager, I think
>> there are more clues in it.
>>
>> Best,
>> Yangze Guo
>>
>> On Thu, Sep 3, 2020 at 4:39 AM Ken Krugler <kkrugler_li...@transpac.com> 
>> wrote:
>> >
>> > Hi all,
>> >
>> > I’ve got a streaming workflow (using Flink 1.11.1) that runs fine locally 
>> > (via Eclipse), with a parallelism of either 3 or 6.
>> >
>> > If I set up part of the workflow to use a specific (not “default”) slot 
>> > sharing group with a parallelism of 3, and the remaining portions of the 
>> > workflow have a parallelism of either 1 or 2, then the workflow never 
>> > starts running, and eventually fails due to a slot request not being 
>> > fulfilled in time.
>> >
>> > So I’m wondering how best to debug this.
>> >
>> > I don’t see any information (even at DEBUG level) being logged about which 
>> > operators are in what slot sharing group, or which slots are assigned to 
>> > what groups.
>> >
>> > Thanks,
>> >
>> > — Ken
>> >
>> > PS - I’ve looked at https://issues.apache.org/jira/browse/FLINK-8712, and 
>> > tried the approach of setting # of slots in the config, but that didn’t 
>> > change anything. I see that issue is still open, so wondering what Til and 
>> > Konstantin have to say about it.
>> >
>> > --------------------------
>> > Ken Krugler
>> > http://www.scaleunlimited.com
>> > custom big data solutions & training
>> > Hadoop, Cascading, Cassandra & Solr
>> >
>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>

Re: Use of slot sharing groups causing workflow to hang

Reply via email to