Re: [DISCUSS] FLIP-119: Pipelined Region Scheduling

Yangze Guo Sun, 29 Mar 2020 18:49:07 -0700

Hi, Zhu Zhu,

Thanks for the explanation and information. I also agree that it out
of the scope of this FLIP and glad to see that we are moving forward
to a more extensible scheduler!


Best,
Yangze Guo

On Mon, Mar 30, 2020 at 1:31 AM Zhu Zhu <reed...@gmail.com> wrote:
>
> Thanks for the comments!
>
> To Xintong,
> It's a bit strange since the in page links work as expected. Would you take
> another try?
>
> To Till,
> - Regarding the idea to improve to SlotProvider interface
> I think it is a good idea and thanks a lot! In the current design we make
> slot requests for batch jobs to wait for resources without timeout as long
> as the JM see enough slots overall. This implicitly add assumption that
> tasks can finish and slots are be returned. This, however, would not work
> in the mixed bounded/unbounded workloads as you mentioned.
> Your idea looks more clear that it always allow slot allocations to wait
> and not time out as long as it see enough slots. And the 'enough' check is
> with regard to slots that can be returned (for bounded tasks) and slots
> that will be occupied forever (for unbounded tasks), so that streaming jobs
> can naturally throw slot allocation timeout errors if the cluster does not
> have enough resources for all the tasks to run at the same time.
> I will take a deeper thought to see how we can implement it this way.
>
> - Regarding the idea to solve "Resource deadlocks when slot allocation
> competition happens between multiple jobs in a session cluster"
> Agreed it's also possible to let the RM to revoke the slots to unblock the
> oldest bulk of requests first. That would require some extra work to change
> the RM to holds the requests before it is sure the slots are successfully
> assigned to the JM (currently the RM removes pending requests right after
> the requests are sent to TM without confirming wether the slot offers
> succeed). We can look deeper into it later when we are about to support
> variant sizes slots.
>
> Thanks,
> Zhu Zhu
>
>
> Till Rohrmann <trohrm...@apache.org> 于2020年3月27日周五 下午10:59写道：
>
> > Thanks for creating this FLIP Zhu Zhu and Gary!
> >
> > +1 for adding pipelined region scheduling.
> >
> > Concerning the extended SlotProvider interface I have an idea how we could
> > further improve it. If I am not mistaken, then you have proposed to
> > introduce the two timeouts in order to distinguish between batch and
> > streaming jobs and to encode that batch job requests can wait if there are
> > enough resources in the SlotPool (not necessarily being available right
> > now). I think what we actually need to tell the SlotProvider is whether a
> > request will use the slot only for a limited time or not. This is exactly
> > the difference between processing bounded and unbounded streams. If the
> > SlotProvider knows this difference, then it can tell which slots will
> > eventually be reusable and which not. Based on this it can tell whether a
> > slot request can be fulfilled eventually or whether we fail after the
> > specified timeout. Another benefit of this approach would be that we can
> > easily support mixed bounded/unbounded workloads. What we would need to
> > know for this approach is whether a pipelined region is processing a
> > bounded or unbounded stream.
> >
> > To give an example let's assume we request the following sets of slots
> > where each pipelined region requires the same slots:
> >
> > slotProvider.allocateSlots(pr1_bounded, timeout);
> > slotProvider.allocateSlots(pr2_unbounded, timeout);
> > slotProvider.allocateSlots(pr3_bounded, timeout);
> >
> > Let's assume we receive slots for pr1_bounded in < timeout and can then
> > fulfill the request. Then we request pr2_unbounded. Since we know that
> > pr1_bounded will complete eventually, we don't fail this request after
> > timeout. Next we request pr3_bounded after pr2_unbounded has been
> > completed. In this case, we see that we need to request new resources
> > because pr2_unbounded won't release its slots. Hence, if we cannot allocate
> > new resources within timeout, we fail this request.
> >
> > A small comment concerning "Resource deadlocks when slot allocation
> > competition happens between multiple jobs in a session cluster": Another
> > idea to solve this situation would be to give the ResourceManager the right
> > to revoke slot assignments in order to change the mapping between requests
> > and available slots.
> >
> > Cheers,
> > Till
> >
> > On Fri, Mar 27, 2020 at 12:44 PM Xintong Song <tonysong...@gmail.com>
> > wrote:
> >
> > > Gary & Zhu Zhu,
> > >
> > > Thanks for preparing this FLIP, and a BIG +1 from my side. The trade-off
> > > between resource utilization and potential deadlock problems has always
> > > been a pain. Despite not solving all the deadlock cases, this FLIP is
> > > definitely a big improvement. IIUC, it has already covered all the
> > existing
> > > single job cases, and all the mentioned non-covered cases are either in
> > > multi-job session clusters or with diverse slot resources in future.
> > >
> > > I've read through the FLIP, and it looks really good to me. Good job! All
> > > the concerns and limitations that I can think of have already been
> > clearly
> > > stated, with reasonable potential future solutions. From the perspective
> > of
> > > fine-grained resource management, I do not see any serious/irresolvable
> > > conflict at this time.
> > >
> > > nit: The in-page links are not working. I guess those are copied from
> > > google docs directly?
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Mar 27, 2020 at 6:26 PM Zhu Zhu <reed...@gmail.com> wrote:
> > >
> > > > To Yangze,
> > > >
> > > > >> the blocking edge will not be consumable before the upstream is
> > > > finished.
> > > > Yes. This is how we define a BLOCKING result partition, "Blocking
> > > > partitions represent blocking data exchanges, where the data stream is
> > > > first fully produced and then consumed".
> > > >
> > > > >> I'm also wondering could we execute the upstream and downstream
> > > regions
> > > > at the same time if we have enough resources
> > > > It may lead to resource waste since the tasks in downstream regions
> > > cannot
> > > > read any data before the upstream region finishes. It saves a bit time
> > on
> > > > schedule, but usually it does not make much difference for large jobs,
> > > > since data processing takes much more time. For small jobs, one can
> > make
> > > > all edges PIPELINED so that all the tasks can be scheduled at the same
> > > > time.
> > > >
> > > > >> is it possible to change the data exchange mode of two regions
> > > > dynamically?
> > > > This is not in the scope of the FLIP. But we are moving forward to a
> > more
> > > > extensible scheduler (FLINK-10429) and resource aware scheduling
> > > > (FLINK-10407).
> > > > So I think it's possible we can have a scheduler in the future which
> > > > dynamically changes the shuffle type wisely regarding available
> > > resources.
> > > >
> > > > Thanks,
> > > > Zhu Zhu
> > > >
> > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 下午4:49写道：
> > > >
> > > > > Thanks for updating!
> > > > >
> > > > > +1 for supporting the pipelined region scheduling. Although we could
> > > > > not prevent resource deadlock in all scenarios, it is really a big
> > > > > step.
> > > > >
> > > > > The design generally LGTM.
> > > > >
> > > > > One minor thing I want to make sure. If I understand correctly, the
> > > > > blocking edge will not be consumable before the upstream is finished.
> > > > > Without it, when the failure occurs in the upstream region, there is
> > > > > still possible to have a resource deadlock. I don't know whether it
> > is
> > > > > an explicit protocol now. But after this FLIP, I think it should not
> > > > > be broken.
> > > > > I'm also wondering could we execute the upstream and downstream
> > > > > regions at the same time if we have enough resources. It can shorten
> > > > > the running time of large job. We should not break the protocol of
> > > > > blocking edge. But if it is possible to change the data exchange mode
> > > > > of two regions dynamically?
> > > > >
> > > > > Best,
> > > > > Yangze Guo
> > > > >
> > > > > On Fri, Mar 27, 2020 at 1:15 PM Zhu Zhu <reed...@gmail.com> wrote:
> > > > > >
> > > > > > Thanks for reporting this Yangze.
> > > > > > I have update the permission to those images. Everyone are able to
> > > view
> > > > > them now.
> > > > > >
> > > > > > Thanks,
> > > > > > Zhu Zhu
> > > > > >
> > > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 上午11:25写道：
> > > > > >>
> > > > > >> Thanks for driving this discussion, Zhu Zhu & Gary.
> > > > > >>
> > > > > >> I found that the image link in this FLIP is not working well.
> > When I
> > > > > >> open that link, Google doc told me that I have no access
> > privilege.
> > > > > >> Could you take a look at that issue?
> > > > > >>
> > > > > >> Best,
> > > > > >> Yangze Guo
> > > > > >>
> > > > > >> On Fri, Mar 27, 2020 at 1:38 AM Gary Yao <g...@apache.org> wrote:
> > > > > >> >
> > > > > >> > Hi community,
> > > > > >> >
> > > > > >> > In the past releases, we have been working on refactoring
> > Flink's
> > > > > scheduler
> > > > > >> > with the goal of making the scheduler extensible [1]. We have
> > > rolled
> > > > > out
> > > > > >> > most of the intended refactoring in Flink 1.10, and we think it
> > is
> > > > > now time
> > > > > >> > to leverage our newly introduced abstractions to implement a new
> > > > > resource
> > > > > >> > optimized scheduling strategy: Pipelined Region Scheduling.
> > > > > >> >
> > > > > >> > This scheduling strategy aims at:
> > > > > >> >
> > > > > >> >     * avoidance of resource deadlocks when running batch jobs
> > > > > >> >
> > > > > >> >     * tunable with respect to resource consumption and
> > throughput
> > > > > >> >
> > > > > >> > More details can be found in the Wiki [2]. We are looking
> > forward
> > > to
> > > > > your
> > > > > >> > feedback.
> > > > > >> >
> > > > > >> > Best,
> > > > > >> >
> > > > > >> > Zhu Zhu & Gary
> > > > > >> >
> > > > > >> > [1] https://issues.apache.org/jira/browse/FLINK-10429
> > > > > >> >
> > > > > >> > [2]
> > > > > >> >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-119: Pipelined Region Scheduling

Reply via email to