@ZhuZhu The links work for me now. Someone might have fixed them. Never mind.
Thank you~ Xintong Song On Mon, Mar 30, 2020 at 1:31 AM Zhu Zhu <reed...@gmail.com> wrote: > Thanks for the comments! > > To Xintong, > It's a bit strange since the in page links work as expected. Would you take > another try? > > To Till, > - Regarding the idea to improve to SlotProvider interface > I think it is a good idea and thanks a lot! In the current design we make > slot requests for batch jobs to wait for resources without timeout as long > as the JM see enough slots overall. This implicitly add assumption that > tasks can finish and slots are be returned. This, however, would not work > in the mixed bounded/unbounded workloads as you mentioned. > Your idea looks more clear that it always allow slot allocations to wait > and not time out as long as it see enough slots. And the 'enough' check is > with regard to slots that can be returned (for bounded tasks) and slots > that will be occupied forever (for unbounded tasks), so that streaming jobs > can naturally throw slot allocation timeout errors if the cluster does not > have enough resources for all the tasks to run at the same time. > I will take a deeper thought to see how we can implement it this way. > > - Regarding the idea to solve "Resource deadlocks when slot allocation > competition happens between multiple jobs in a session cluster" > Agreed it's also possible to let the RM to revoke the slots to unblock the > oldest bulk of requests first. That would require some extra work to change > the RM to holds the requests before it is sure the slots are successfully > assigned to the JM (currently the RM removes pending requests right after > the requests are sent to TM without confirming wether the slot offers > succeed). We can look deeper into it later when we are about to support > variant sizes slots. > > Thanks, > Zhu Zhu > > > Till Rohrmann <trohrm...@apache.org> 于2020年3月27日周五 下午10:59写道: > > > Thanks for creating this FLIP Zhu Zhu and Gary! > > > > +1 for adding pipelined region scheduling. > > > > Concerning the extended SlotProvider interface I have an idea how we > could > > further improve it. If I am not mistaken, then you have proposed to > > introduce the two timeouts in order to distinguish between batch and > > streaming jobs and to encode that batch job requests can wait if there > are > > enough resources in the SlotPool (not necessarily being available right > > now). I think what we actually need to tell the SlotProvider is whether a > > request will use the slot only for a limited time or not. This is exactly > > the difference between processing bounded and unbounded streams. If the > > SlotProvider knows this difference, then it can tell which slots will > > eventually be reusable and which not. Based on this it can tell whether a > > slot request can be fulfilled eventually or whether we fail after the > > specified timeout. Another benefit of this approach would be that we can > > easily support mixed bounded/unbounded workloads. What we would need to > > know for this approach is whether a pipelined region is processing a > > bounded or unbounded stream. > > > > To give an example let's assume we request the following sets of slots > > where each pipelined region requires the same slots: > > > > slotProvider.allocateSlots(pr1_bounded, timeout); > > slotProvider.allocateSlots(pr2_unbounded, timeout); > > slotProvider.allocateSlots(pr3_bounded, timeout); > > > > Let's assume we receive slots for pr1_bounded in < timeout and can then > > fulfill the request. Then we request pr2_unbounded. Since we know that > > pr1_bounded will complete eventually, we don't fail this request after > > timeout. Next we request pr3_bounded after pr2_unbounded has been > > completed. In this case, we see that we need to request new resources > > because pr2_unbounded won't release its slots. Hence, if we cannot > allocate > > new resources within timeout, we fail this request. > > > > A small comment concerning "Resource deadlocks when slot allocation > > competition happens between multiple jobs in a session cluster": Another > > idea to solve this situation would be to give the ResourceManager the > right > > to revoke slot assignments in order to change the mapping between > requests > > and available slots. > > > > Cheers, > > Till > > > > On Fri, Mar 27, 2020 at 12:44 PM Xintong Song <tonysong...@gmail.com> > > wrote: > > > > > Gary & Zhu Zhu, > > > > > > Thanks for preparing this FLIP, and a BIG +1 from my side. The > trade-off > > > between resource utilization and potential deadlock problems has always > > > been a pain. Despite not solving all the deadlock cases, this FLIP is > > > definitely a big improvement. IIUC, it has already covered all the > > existing > > > single job cases, and all the mentioned non-covered cases are either in > > > multi-job session clusters or with diverse slot resources in future. > > > > > > I've read through the FLIP, and it looks really good to me. Good job! > All > > > the concerns and limitations that I can think of have already been > > clearly > > > stated, with reasonable potential future solutions. From the > perspective > > of > > > fine-grained resource management, I do not see any serious/irresolvable > > > conflict at this time. > > > > > > nit: The in-page links are not working. I guess those are copied from > > > google docs directly? > > > > > > > > > Thank you~ > > > > > > Xintong Song > > > > > > > > > > > > On Fri, Mar 27, 2020 at 6:26 PM Zhu Zhu <reed...@gmail.com> wrote: > > > > > > > To Yangze, > > > > > > > > >> the blocking edge will not be consumable before the upstream is > > > > finished. > > > > Yes. This is how we define a BLOCKING result partition, "Blocking > > > > partitions represent blocking data exchanges, where the data stream > is > > > > first fully produced and then consumed". > > > > > > > > >> I'm also wondering could we execute the upstream and downstream > > > regions > > > > at the same time if we have enough resources > > > > It may lead to resource waste since the tasks in downstream regions > > > cannot > > > > read any data before the upstream region finishes. It saves a bit > time > > on > > > > schedule, but usually it does not make much difference for large > jobs, > > > > since data processing takes much more time. For small jobs, one can > > make > > > > all edges PIPELINED so that all the tasks can be scheduled at the > same > > > > time. > > > > > > > > >> is it possible to change the data exchange mode of two regions > > > > dynamically? > > > > This is not in the scope of the FLIP. But we are moving forward to a > > more > > > > extensible scheduler (FLINK-10429) and resource aware scheduling > > > > (FLINK-10407). > > > > So I think it's possible we can have a scheduler in the future which > > > > dynamically changes the shuffle type wisely regarding available > > > resources. > > > > > > > > Thanks, > > > > Zhu Zhu > > > > > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 下午4:49写道: > > > > > > > > > Thanks for updating! > > > > > > > > > > +1 for supporting the pipelined region scheduling. Although we > could > > > > > not prevent resource deadlock in all scenarios, it is really a big > > > > > step. > > > > > > > > > > The design generally LGTM. > > > > > > > > > > One minor thing I want to make sure. If I understand correctly, the > > > > > blocking edge will not be consumable before the upstream is > finished. > > > > > Without it, when the failure occurs in the upstream region, there > is > > > > > still possible to have a resource deadlock. I don't know whether it > > is > > > > > an explicit protocol now. But after this FLIP, I think it should > not > > > > > be broken. > > > > > I'm also wondering could we execute the upstream and downstream > > > > > regions at the same time if we have enough resources. It can > shorten > > > > > the running time of large job. We should not break the protocol of > > > > > blocking edge. But if it is possible to change the data exchange > mode > > > > > of two regions dynamically? > > > > > > > > > > Best, > > > > > Yangze Guo > > > > > > > > > > On Fri, Mar 27, 2020 at 1:15 PM Zhu Zhu <reed...@gmail.com> wrote: > > > > > > > > > > > > Thanks for reporting this Yangze. > > > > > > I have update the permission to those images. Everyone are able > to > > > view > > > > > them now. > > > > > > > > > > > > Thanks, > > > > > > Zhu Zhu > > > > > > > > > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 上午11:25写道: > > > > > >> > > > > > >> Thanks for driving this discussion, Zhu Zhu & Gary. > > > > > >> > > > > > >> I found that the image link in this FLIP is not working well. > > When I > > > > > >> open that link, Google doc told me that I have no access > > privilege. > > > > > >> Could you take a look at that issue? > > > > > >> > > > > > >> Best, > > > > > >> Yangze Guo > > > > > >> > > > > > >> On Fri, Mar 27, 2020 at 1:38 AM Gary Yao <g...@apache.org> > wrote: > > > > > >> > > > > > > >> > Hi community, > > > > > >> > > > > > > >> > In the past releases, we have been working on refactoring > > Flink's > > > > > scheduler > > > > > >> > with the goal of making the scheduler extensible [1]. We have > > > rolled > > > > > out > > > > > >> > most of the intended refactoring in Flink 1.10, and we think > it > > is > > > > > now time > > > > > >> > to leverage our newly introduced abstractions to implement a > new > > > > > resource > > > > > >> > optimized scheduling strategy: Pipelined Region Scheduling. > > > > > >> > > > > > > >> > This scheduling strategy aims at: > > > > > >> > > > > > > >> > * avoidance of resource deadlocks when running batch jobs > > > > > >> > > > > > > >> > * tunable with respect to resource consumption and > > throughput > > > > > >> > > > > > > >> > More details can be found in the Wiki [2]. We are looking > > forward > > > to > > > > > your > > > > > >> > feedback. > > > > > >> > > > > > > >> > Best, > > > > > >> > > > > > > >> > Zhu Zhu & Gary > > > > > >> > > > > > > >> > [1] https://issues.apache.org/jira/browse/FLINK-10429 > > > > > >> > > > > > > >> > [2] > > > > > >> > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling > > > > > > > > > > > > > > >