Re: [DISCUSS] FLIP-119: Pipelined Region Scheduling

Zhu Zhu Mon, 30 Mar 2020 21:36:21 -0700

Thanks for the nice suggestion Till.
The section 'Bulk Slot Allocation' is updated.


Thanks,
Zhu Zhu

Gary Yao <g...@apache.org> 于2020年3月30日周一 下午3:38写道：

> >
> > The links work for me now. Someone might have fixed them. Never mind.
> >
>
> Actually, I fixed the links after seeing your email. Thanks for reporting.
>
> Best,
> Gary
>
> On Mon, Mar 30, 2020 at 3:48 AM Xintong Song <tonysong...@gmail.com>
> wrote:
>
> > @ZhuZhu
> >
> > The links work for me now. Someone might have fixed them. Never mind.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Mar 30, 2020 at 1:31 AM Zhu Zhu <reed...@gmail.com> wrote:
> >
> > > Thanks for the comments!
> > >
> > > To Xintong,
> > > It's a bit strange since the in page links work as expected. Would you
> > take
> > > another try?
> > >
> > > To Till,
> > > - Regarding the idea to improve to SlotProvider interface
> > > I think it is a good idea and thanks a lot! In the current design we
> make
> > > slot requests for batch jobs to wait for resources without timeout as
> > long
> > > as the JM see enough slots overall. This implicitly add assumption that
> > > tasks can finish and slots are be returned. This, however, would not
> work
> > > in the mixed bounded/unbounded workloads as you mentioned.
> > > Your idea looks more clear that it always allow slot allocations to
> wait
> > > and not time out as long as it see enough slots. And the 'enough' check
> > is
> > > with regard to slots that can be returned (for bounded tasks) and slots
> > > that will be occupied forever (for unbounded tasks), so that streaming
> > jobs
> > > can naturally throw slot allocation timeout errors if the cluster does
> > not
> > > have enough resources for all the tasks to run at the same time.
> > > I will take a deeper thought to see how we can implement it this way.
> > >
> > > - Regarding the idea to solve "Resource deadlocks when slot allocation
> > > competition happens between multiple jobs in a session cluster"
> > > Agreed it's also possible to let the RM to revoke the slots to unblock
> > the
> > > oldest bulk of requests first. That would require some extra work to
> > change
> > > the RM to holds the requests before it is sure the slots are
> successfully
> > > assigned to the JM (currently the RM removes pending requests right
> after
> > > the requests are sent to TM without confirming wether the slot offers
> > > succeed). We can look deeper into it later when we are about to support
> > > variant sizes slots.
> > >
> > > Thanks,
> > > Zhu Zhu
> > >
> > >
> > > Till Rohrmann <trohrm...@apache.org> 于2020年3月27日周五 下午10:59写道：
> > >
> > > > Thanks for creating this FLIP Zhu Zhu and Gary!
> > > >
> > > > +1 for adding pipelined region scheduling.
> > > >
> > > > Concerning the extended SlotProvider interface I have an idea how we
> > > could
> > > > further improve it. If I am not mistaken, then you have proposed to
> > > > introduce the two timeouts in order to distinguish between batch and
> > > > streaming jobs and to encode that batch job requests can wait if
> there
> > > are
> > > > enough resources in the SlotPool (not necessarily being available
> right
> > > > now). I think what we actually need to tell the SlotProvider is
> > whether a
> > > > request will use the slot only for a limited time or not. This is
> > exactly
> > > > the difference between processing bounded and unbounded streams. If
> the
> > > > SlotProvider knows this difference, then it can tell which slots will
> > > > eventually be reusable and which not. Based on this it can tell
> > whether a
> > > > slot request can be fulfilled eventually or whether we fail after the
> > > > specified timeout. Another benefit of this approach would be that we
> > can
> > > > easily support mixed bounded/unbounded workloads. What we would need
> to
> > > > know for this approach is whether a pipelined region is processing a
> > > > bounded or unbounded stream.
> > > >
> > > > To give an example let's assume we request the following sets of
> slots
> > > > where each pipelined region requires the same slots:
> > > >
> > > > slotProvider.allocateSlots(pr1_bounded, timeout);
> > > > slotProvider.allocateSlots(pr2_unbounded, timeout);
> > > > slotProvider.allocateSlots(pr3_bounded, timeout);
> > > >
> > > > Let's assume we receive slots for pr1_bounded in < timeout and can
> then
> > > > fulfill the request. Then we request pr2_unbounded. Since we know
> that
> > > > pr1_bounded will complete eventually, we don't fail this request
> after
> > > > timeout. Next we request pr3_bounded after pr2_unbounded has been
> > > > completed. In this case, we see that we need to request new resources
> > > > because pr2_unbounded won't release its slots. Hence, if we cannot
> > > allocate
> > > > new resources within timeout, we fail this request.
> > > >
> > > > A small comment concerning "Resource deadlocks when slot allocation
> > > > competition happens between multiple jobs in a session cluster":
> > Another
> > > > idea to solve this situation would be to give the ResourceManager the
> > > right
> > > > to revoke slot assignments in order to change the mapping between
> > > requests
> > > > and available slots.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Mar 27, 2020 at 12:44 PM Xintong Song <tonysong...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Gary & Zhu Zhu,
> > > > >
> > > > > Thanks for preparing this FLIP, and a BIG +1 from my side. The
> > > trade-off
> > > > > between resource utilization and potential deadlock problems has
> > always
> > > > > been a pain. Despite not solving all the deadlock cases, this FLIP
> is
> > > > > definitely a big improvement. IIUC, it has already covered all the
> > > > existing
> > > > > single job cases, and all the mentioned non-covered cases are
> either
> > in
> > > > > multi-job session clusters or with diverse slot resources in
> future.
> > > > >
> > > > > I've read through the FLIP, and it looks really good to me. Good
> job!
> > > All
> > > > > the concerns and limitations that I can think of have already been
> > > > clearly
> > > > > stated, with reasonable potential future solutions. From the
> > > perspective
> > > > of
> > > > > fine-grained resource management, I do not see any
> > serious/irresolvable
> > > > > conflict at this time.
> > > > >
> > > > > nit: The in-page links are not working. I guess those are copied
> from
> > > > > google docs directly?
> > > > >
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Mar 27, 2020 at 6:26 PM Zhu Zhu <reed...@gmail.com> wrote:
> > > > >
> > > > > > To Yangze,
> > > > > >
> > > > > > >> the blocking edge will not be consumable before the upstream
> is
> > > > > > finished.
> > > > > > Yes. This is how we define a BLOCKING result partition, "Blocking
> > > > > > partitions represent blocking data exchanges, where the data
> stream
> > > is
> > > > > > first fully produced and then consumed".
> > > > > >
> > > > > > >> I'm also wondering could we execute the upstream and
> downstream
> > > > > regions
> > > > > > at the same time if we have enough resources
> > > > > > It may lead to resource waste since the tasks in downstream
> regions
> > > > > cannot
> > > > > > read any data before the upstream region finishes. It saves a bit
> > > time
> > > > on
> > > > > > schedule, but usually it does not make much difference for large
> > > jobs,
> > > > > > since data processing takes much more time. For small jobs, one
> can
> > > > make
> > > > > > all edges PIPELINED so that all the tasks can be scheduled at the
> > > same
> > > > > > time.
> > > > > >
> > > > > > >> is it possible to change the data exchange mode of two regions
> > > > > > dynamically?
> > > > > > This is not in the scope of the FLIP. But we are moving forward
> to
> > a
> > > > more
> > > > > > extensible scheduler (FLINK-10429) and resource aware scheduling
> > > > > > (FLINK-10407).
> > > > > > So I think it's possible we can have a scheduler in the future
> > which
> > > > > > dynamically changes the shuffle type wisely regarding available
> > > > > resources.
> > > > > >
> > > > > > Thanks,
> > > > > > Zhu Zhu
> > > > > >
> > > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 下午4:49写道：
> > > > > >
> > > > > > > Thanks for updating!
> > > > > > >
> > > > > > > +1 for supporting the pipelined region scheduling. Although we
> > > could
> > > > > > > not prevent resource deadlock in all scenarios, it is really a
> > big
> > > > > > > step.
> > > > > > >
> > > > > > > The design generally LGTM.
> > > > > > >
> > > > > > > One minor thing I want to make sure. If I understand correctly,
> > the
> > > > > > > blocking edge will not be consumable before the upstream is
> > > finished.
> > > > > > > Without it, when the failure occurs in the upstream region,
> there
> > > is
> > > > > > > still possible to have a resource deadlock. I don't know
> whether
> > it
> > > > is
> > > > > > > an explicit protocol now. But after this FLIP, I think it
> should
> > > not
> > > > > > > be broken.
> > > > > > > I'm also wondering could we execute the upstream and downstream
> > > > > > > regions at the same time if we have enough resources. It can
> > > shorten
> > > > > > > the running time of large job. We should not break the protocol
> > of
> > > > > > > blocking edge. But if it is possible to change the data
> exchange
> > > mode
> > > > > > > of two regions dynamically?
> > > > > > >
> > > > > > > Best,
> > > > > > > Yangze Guo
> > > > > > >
> > > > > > > On Fri, Mar 27, 2020 at 1:15 PM Zhu Zhu <reed...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > Thanks for reporting this Yangze.
> > > > > > > > I have update the permission to those images. Everyone are
> able
> > > to
> > > > > view
> > > > > > > them now.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Zhu Zhu
> > > > > > > >
> > > > > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 上午11:25写道：
> > > > > > > >>
> > > > > > > >> Thanks for driving this discussion, Zhu Zhu & Gary.
> > > > > > > >>
> > > > > > > >> I found that the image link in this FLIP is not working
> well.
> > > > When I
> > > > > > > >> open that link, Google doc told me that I have no access
> > > > privilege.
> > > > > > > >> Could you take a look at that issue?
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> Yangze Guo
> > > > > > > >>
> > > > > > > >> On Fri, Mar 27, 2020 at 1:38 AM Gary Yao <g...@apache.org>
> > > wrote:
> > > > > > > >> >
> > > > > > > >> > Hi community,
> > > > > > > >> >
> > > > > > > >> > In the past releases, we have been working on refactoring
> > > > Flink's
> > > > > > > scheduler
> > > > > > > >> > with the goal of making the scheduler extensible [1]. We
> > have
> > > > > rolled
> > > > > > > out
> > > > > > > >> > most of the intended refactoring in Flink 1.10, and we
> think
> > > it
> > > > is
> > > > > > > now time
> > > > > > > >> > to leverage our newly introduced abstractions to
> implement a
> > > new
> > > > > > > resource
> > > > > > > >> > optimized scheduling strategy: Pipelined Region
> Scheduling.
> > > > > > > >> >
> > > > > > > >> > This scheduling strategy aims at:
> > > > > > > >> >
> > > > > > > >> >     * avoidance of resource deadlocks when running batch
> > jobs
> > > > > > > >> >
> > > > > > > >> >     * tunable with respect to resource consumption and
> > > > throughput
> > > > > > > >> >
> > > > > > > >> > More details can be found in the Wiki [2]. We are looking
> > > > forward
> > > > > to
> > > > > > > your
> > > > > > > >> > feedback.
> > > > > > > >> >
> > > > > > > >> > Best,
> > > > > > > >> >
> > > > > > > >> > Zhu Zhu & Gary
> > > > > > > >> >
> > > > > > > >> > [1] https://issues.apache.org/jira/browse/FLINK-10429
> > > > > > > >> >
> > > > > > > >> > [2]
> > > > > > > >> >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-119: Pipelined Region Scheduling

Reply via email to