Thanks for the nice suggestion Till. The section 'Bulk Slot Allocation' is updated.
Thanks, Zhu Zhu Gary Yao <g...@apache.org> 于2020年3月30日周一 下午3:38写道: > > > > The links work for me now. Someone might have fixed them. Never mind. > > > > Actually, I fixed the links after seeing your email. Thanks for reporting. > > Best, > Gary > > On Mon, Mar 30, 2020 at 3:48 AM Xintong Song <tonysong...@gmail.com> > wrote: > > > @ZhuZhu > > > > The links work for me now. Someone might have fixed them. Never mind. > > > > Thank you~ > > > > Xintong Song > > > > > > > > On Mon, Mar 30, 2020 at 1:31 AM Zhu Zhu <reed...@gmail.com> wrote: > > > > > Thanks for the comments! > > > > > > To Xintong, > > > It's a bit strange since the in page links work as expected. Would you > > take > > > another try? > > > > > > To Till, > > > - Regarding the idea to improve to SlotProvider interface > > > I think it is a good idea and thanks a lot! In the current design we > make > > > slot requests for batch jobs to wait for resources without timeout as > > long > > > as the JM see enough slots overall. This implicitly add assumption that > > > tasks can finish and slots are be returned. This, however, would not > work > > > in the mixed bounded/unbounded workloads as you mentioned. > > > Your idea looks more clear that it always allow slot allocations to > wait > > > and not time out as long as it see enough slots. And the 'enough' check > > is > > > with regard to slots that can be returned (for bounded tasks) and slots > > > that will be occupied forever (for unbounded tasks), so that streaming > > jobs > > > can naturally throw slot allocation timeout errors if the cluster does > > not > > > have enough resources for all the tasks to run at the same time. > > > I will take a deeper thought to see how we can implement it this way. > > > > > > - Regarding the idea to solve "Resource deadlocks when slot allocation > > > competition happens between multiple jobs in a session cluster" > > > Agreed it's also possible to let the RM to revoke the slots to unblock > > the > > > oldest bulk of requests first. That would require some extra work to > > change > > > the RM to holds the requests before it is sure the slots are > successfully > > > assigned to the JM (currently the RM removes pending requests right > after > > > the requests are sent to TM without confirming wether the slot offers > > > succeed). We can look deeper into it later when we are about to support > > > variant sizes slots. > > > > > > Thanks, > > > Zhu Zhu > > > > > > > > > Till Rohrmann <trohrm...@apache.org> 于2020年3月27日周五 下午10:59写道: > > > > > > > Thanks for creating this FLIP Zhu Zhu and Gary! > > > > > > > > +1 for adding pipelined region scheduling. > > > > > > > > Concerning the extended SlotProvider interface I have an idea how we > > > could > > > > further improve it. If I am not mistaken, then you have proposed to > > > > introduce the two timeouts in order to distinguish between batch and > > > > streaming jobs and to encode that batch job requests can wait if > there > > > are > > > > enough resources in the SlotPool (not necessarily being available > right > > > > now). I think what we actually need to tell the SlotProvider is > > whether a > > > > request will use the slot only for a limited time or not. This is > > exactly > > > > the difference between processing bounded and unbounded streams. If > the > > > > SlotProvider knows this difference, then it can tell which slots will > > > > eventually be reusable and which not. Based on this it can tell > > whether a > > > > slot request can be fulfilled eventually or whether we fail after the > > > > specified timeout. Another benefit of this approach would be that we > > can > > > > easily support mixed bounded/unbounded workloads. What we would need > to > > > > know for this approach is whether a pipelined region is processing a > > > > bounded or unbounded stream. > > > > > > > > To give an example let's assume we request the following sets of > slots > > > > where each pipelined region requires the same slots: > > > > > > > > slotProvider.allocateSlots(pr1_bounded, timeout); > > > > slotProvider.allocateSlots(pr2_unbounded, timeout); > > > > slotProvider.allocateSlots(pr3_bounded, timeout); > > > > > > > > Let's assume we receive slots for pr1_bounded in < timeout and can > then > > > > fulfill the request. Then we request pr2_unbounded. Since we know > that > > > > pr1_bounded will complete eventually, we don't fail this request > after > > > > timeout. Next we request pr3_bounded after pr2_unbounded has been > > > > completed. In this case, we see that we need to request new resources > > > > because pr2_unbounded won't release its slots. Hence, if we cannot > > > allocate > > > > new resources within timeout, we fail this request. > > > > > > > > A small comment concerning "Resource deadlocks when slot allocation > > > > competition happens between multiple jobs in a session cluster": > > Another > > > > idea to solve this situation would be to give the ResourceManager the > > > right > > > > to revoke slot assignments in order to change the mapping between > > > requests > > > > and available slots. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Fri, Mar 27, 2020 at 12:44 PM Xintong Song <tonysong...@gmail.com > > > > > > wrote: > > > > > > > > > Gary & Zhu Zhu, > > > > > > > > > > Thanks for preparing this FLIP, and a BIG +1 from my side. The > > > trade-off > > > > > between resource utilization and potential deadlock problems has > > always > > > > > been a pain. Despite not solving all the deadlock cases, this FLIP > is > > > > > definitely a big improvement. IIUC, it has already covered all the > > > > existing > > > > > single job cases, and all the mentioned non-covered cases are > either > > in > > > > > multi-job session clusters or with diverse slot resources in > future. > > > > > > > > > > I've read through the FLIP, and it looks really good to me. Good > job! > > > All > > > > > the concerns and limitations that I can think of have already been > > > > clearly > > > > > stated, with reasonable potential future solutions. From the > > > perspective > > > > of > > > > > fine-grained resource management, I do not see any > > serious/irresolvable > > > > > conflict at this time. > > > > > > > > > > nit: The in-page links are not working. I guess those are copied > from > > > > > google docs directly? > > > > > > > > > > > > > > > Thank you~ > > > > > > > > > > Xintong Song > > > > > > > > > > > > > > > > > > > > On Fri, Mar 27, 2020 at 6:26 PM Zhu Zhu <reed...@gmail.com> wrote: > > > > > > > > > > > To Yangze, > > > > > > > > > > > > >> the blocking edge will not be consumable before the upstream > is > > > > > > finished. > > > > > > Yes. This is how we define a BLOCKING result partition, "Blocking > > > > > > partitions represent blocking data exchanges, where the data > stream > > > is > > > > > > first fully produced and then consumed". > > > > > > > > > > > > >> I'm also wondering could we execute the upstream and > downstream > > > > > regions > > > > > > at the same time if we have enough resources > > > > > > It may lead to resource waste since the tasks in downstream > regions > > > > > cannot > > > > > > read any data before the upstream region finishes. It saves a bit > > > time > > > > on > > > > > > schedule, but usually it does not make much difference for large > > > jobs, > > > > > > since data processing takes much more time. For small jobs, one > can > > > > make > > > > > > all edges PIPELINED so that all the tasks can be scheduled at the > > > same > > > > > > time. > > > > > > > > > > > > >> is it possible to change the data exchange mode of two regions > > > > > > dynamically? > > > > > > This is not in the scope of the FLIP. But we are moving forward > to > > a > > > > more > > > > > > extensible scheduler (FLINK-10429) and resource aware scheduling > > > > > > (FLINK-10407). > > > > > > So I think it's possible we can have a scheduler in the future > > which > > > > > > dynamically changes the shuffle type wisely regarding available > > > > > resources. > > > > > > > > > > > > Thanks, > > > > > > Zhu Zhu > > > > > > > > > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 下午4:49写道: > > > > > > > > > > > > > Thanks for updating! > > > > > > > > > > > > > > +1 for supporting the pipelined region scheduling. Although we > > > could > > > > > > > not prevent resource deadlock in all scenarios, it is really a > > big > > > > > > > step. > > > > > > > > > > > > > > The design generally LGTM. > > > > > > > > > > > > > > One minor thing I want to make sure. If I understand correctly, > > the > > > > > > > blocking edge will not be consumable before the upstream is > > > finished. > > > > > > > Without it, when the failure occurs in the upstream region, > there > > > is > > > > > > > still possible to have a resource deadlock. I don't know > whether > > it > > > > is > > > > > > > an explicit protocol now. But after this FLIP, I think it > should > > > not > > > > > > > be broken. > > > > > > > I'm also wondering could we execute the upstream and downstream > > > > > > > regions at the same time if we have enough resources. It can > > > shorten > > > > > > > the running time of large job. We should not break the protocol > > of > > > > > > > blocking edge. But if it is possible to change the data > exchange > > > mode > > > > > > > of two regions dynamically? > > > > > > > > > > > > > > Best, > > > > > > > Yangze Guo > > > > > > > > > > > > > > On Fri, Mar 27, 2020 at 1:15 PM Zhu Zhu <reed...@gmail.com> > > wrote: > > > > > > > > > > > > > > > > Thanks for reporting this Yangze. > > > > > > > > I have update the permission to those images. Everyone are > able > > > to > > > > > view > > > > > > > them now. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Zhu Zhu > > > > > > > > > > > > > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 上午11:25写道: > > > > > > > >> > > > > > > > >> Thanks for driving this discussion, Zhu Zhu & Gary. > > > > > > > >> > > > > > > > >> I found that the image link in this FLIP is not working > well. > > > > When I > > > > > > > >> open that link, Google doc told me that I have no access > > > > privilege. > > > > > > > >> Could you take a look at that issue? > > > > > > > >> > > > > > > > >> Best, > > > > > > > >> Yangze Guo > > > > > > > >> > > > > > > > >> On Fri, Mar 27, 2020 at 1:38 AM Gary Yao <g...@apache.org> > > > wrote: > > > > > > > >> > > > > > > > > >> > Hi community, > > > > > > > >> > > > > > > > > >> > In the past releases, we have been working on refactoring > > > > Flink's > > > > > > > scheduler > > > > > > > >> > with the goal of making the scheduler extensible [1]. We > > have > > > > > rolled > > > > > > > out > > > > > > > >> > most of the intended refactoring in Flink 1.10, and we > think > > > it > > > > is > > > > > > > now time > > > > > > > >> > to leverage our newly introduced abstractions to > implement a > > > new > > > > > > > resource > > > > > > > >> > optimized scheduling strategy: Pipelined Region > Scheduling. > > > > > > > >> > > > > > > > > >> > This scheduling strategy aims at: > > > > > > > >> > > > > > > > > >> > * avoidance of resource deadlocks when running batch > > jobs > > > > > > > >> > > > > > > > > >> > * tunable with respect to resource consumption and > > > > throughput > > > > > > > >> > > > > > > > > >> > More details can be found in the Wiki [2]. We are looking > > > > forward > > > > > to > > > > > > > your > > > > > > > >> > feedback. > > > > > > > >> > > > > > > > > >> > Best, > > > > > > > >> > > > > > > > > >> > Zhu Zhu & Gary > > > > > > > >> > > > > > > > > >> > [1] https://issues.apache.org/jira/browse/FLINK-10429 > > > > > > > >> > > > > > > > > >> > [2] > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling > > > > > > > > > > > > > > > > > > > > > > > > > > > >