Re: Use of slot sharing groups causing workflow to hang

Xintong Song Wed, 09 Sep 2020 22:36:24 -0700

Hi Ken,

I've got a Flink MiniCluster with 12 slots. Even with only 6 pipelined
> operators, each with a parallelism of 1, it still hangs while starting.
>


Could you double check that the minicluster has 12 slots when each or your
operators has only 1 parallelism?

I've looked into the codes. Currently, without any explicit configurations,
minicluster will by default create 1 taskmanager, and the number of slots
on that taskmanager is decided by the max parallelism of your job
operators. That means the number of slots in the minicluster changes
dynamically when the operators' parallelisms change.

By default, all operators are in the same slot sharing groups, thus the
number of slots needed for executing the job is the max parallelism of the
operators. When you separate the operators into more slot sharing groups,
the number of slots needed for job execution increases and you would need
to manually adjust the configurations to provide enough slots. Related
configuration options are `local.number-taskmanager` and
`taskmanager.numberOfTaskSlots`.

Also, please notice that, for local executions (from IDE) the entire flink
application runs in the same process, thus separating the pipeline into
several slot sharing groups will not bring any benefit. If you are just
trying out with the slot sharing groups or preparing for later deploying
the execution to a distributed cluster, then there should be no problem.

Thank you~

Xintong Song



On Thu, Sep 10, 2020 at 11:22 AM Yangze Guo <karma...@gmail.com> wrote:

> Hi, Ken
>
> From the RM perspective, could you share the following logs:
> - "Request slot with profile {} for job {} with allocation id {}.".
> - "Requesting new slot [{}] and profile {} with allocation id {} from
> resource manager."
> This will help to figure out how many slots your job indeed requests.
> And probably help to figure out what the ExecutionGraph finally looks
> like.
>
>
> Best,
> Yangze Guo
>
> On Thu, Sep 10, 2020 at 10:47 AM Ken Krugler
> <kkrugler_li...@transpac.com> wrote:
> >
> > Hi Til,
> >
> > On Sep 3, 2020, at 12:31 AM, Till Rohrmann <trohrm...@apache.org> wrote:
> >
> > Hi Ken,
> >
> > I believe that we don't have a lot if not any explicit logging about the
> slot sharing group in the code. You can, however, learn indirectly about it
> by looking at the required number of AllocatedSlots in the SlotPool. Also
> the number of "multi task slot" which are created should vary because every
> group of slot sharing tasks will create one of them. For learning about the
> SlotPoolImpl's status, you can also take a look at SlotPoolImpl.printStatus.
> >
> > For the underlying problem, I believe that Yangze could be right. How
> many resources do you have in your cluster?
> >
> >
> > I've got a Flink MiniCluster with 12 slots. Even with only 6 pipelined
> > operators, each with a parallelism of 1, it still hangs while starting.
> So
> > I don't think that it's a resource issue.
> >
> > One odd thing I've noticed. I've got three streams that I union together.
> > Two of the streams are in separate slot sharing groups, the third is not
> > assigned to a group. But when I check the logs, I see three "Create multi
> > task slot" entries. I'm wondering if unioning streams that are in
> different
> > slot sharing groups creates a problem.
> >
> > Thanks,
> >
> > -- Ken
> >
> > On Thu, Sep 3, 2020 at 4:25 AM Yangze Guo <karma...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> The failure of requesting slots usually because of the lack of
> >> resources. If you put part of the workflow to a specific slot sharing
> >> group, it may require more slots to run the workflow than before.
> >> Could you share logs of the ResourceManager and SlotManager, I think
> >> there are more clues in it.
> >>
> >> Best,
> >> Yangze Guo
> >>
> >> On Thu, Sep 3, 2020 at 4:39 AM Ken Krugler <kkrugler_li...@transpac.com>
> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > I’ve got a streaming workflow (using Flink 1.11.1) that runs fine
> locally (via Eclipse), with a parallelism of either 3 or 6.
> >> >
> >> > If I set up part of the workflow to use a specific (not “default”)
> slot sharing group with a parallelism of 3, and the remaining portions of
> the workflow have a parallelism of either 1 or 2, then the workflow never
> starts running, and eventually fails due to a slot request not being
> fulfilled in time.
> >> >
> >> > So I’m wondering how best to debug this.
> >> >
> >> > I don’t see any information (even at DEBUG level) being logged about
> which operators are in what slot sharing group, or which slots are assigned
> to what groups.
> >> >
> >> > Thanks,
> >> >
> >> > — Ken
> >> >
> >> > PS - I’ve looked at https://issues.apache.org/jira/browse/FLINK-8712,
> and tried the approach of setting # of slots in the config, but that didn’t
> change anything. I see that issue is still open, so wondering what Til and
> Konstantin have to say about it.
> >> >
> >> > --------------------------
> >> > Ken Krugler
> >> > http://www.scaleunlimited.com
> >> > custom big data solutions & training
> >> > Hadoop, Cascading, Cassandra & Solr
> >> >
> >
> >
> > --------------------------
> > Ken Krugler
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
>

Re: Use of slot sharing groups causing workflow to hang

Reply via email to