Re: Reworking the Rescale API

Maximilian Michels Mon, 06 Feb 2023 07:23:13 -0800

>> I fully agree that in-place scaling is a much harder problem which is out of 
>> the scope for now. My primary concern here is to be able to rescale with 
>> upfront reservation of resources before restarting the job, so the job 
>> doesn't get stuck in case of resource constraints.
> Not sure I follow. The AS only rescales when it has already acquired the 
> slots that it needs.


I'm saying that the primary objective of this thread is to figure out
upfront reservation of resources as part of a new Rescale API. The
adaptive scheduler is a (very reasonable) means to an end to fulfil
this property. If we were to go with another solution because the
adaptive scheduler does not prove to be production ready, then we
would still have to make that property holds. I'm going to experiment
a bit with the adaptive scheduler to see if there are any other
limitations.

As for the slot sharing groups with different maximum parallelism, I
see what the issue is here:
https://github.com/apache/flink/blob/2ae5df278958073fee63b2bf824a53a28a21701b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/allocator/SlotSharingSlotAllocator.java#L97
Should be fixable. I've filed a JIRA here:
https://issues.apache.org/jira/browse/FLINK-30931

-Max

On Fri, Feb 3, 2023 at 10:13 AM Chesnay Schepler <ches...@apache.org> wrote:
>
> > My primary concern here is to be able to rescale with upfront reservation 
> > of resources before restarting the job, so the job doesn't get stuck in 
> > case of resource constraints.
>
> Not sure I follow. The AS only rescales when it has already acquired the 
> slots that it needs.
>
>  > This is a blocker from my side. Why do we have that restriction?
>
> We just didn't bother fixing it initially. It should be easy to fix.
>
> On 02/02/2023 18:29, Maximilian Michels wrote:
> > I fully agree that in-place scaling is a much harder problem which is
> > out of the scope for now. My primary concern here is to be able to
> > rescale with upfront reservation of resources before restarting the
> > job, so the job doesn't get stuck in case of resource constraints.
> >
> >> Unused slots: If the max parallelism for slot sharing groups is not equal, 
> >> slots offered to Adaptive Scheduler might be unused.
> > From: 
> > https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/elastic_scaling/#limitations-1
> >
> > This is a blocker from my side. Why do we have that restriction?
> >
> > On Thu, Feb 2, 2023 at 5:03 PM Chesnay Schepler <ches...@apache.org> wrote:
> >>   > If I understand correctly, the adaptive scheduler currently does a
> >> full job restart. Is there any work planned to enable in-place rescaling
> >> with the adaptive scheduler?
> >>
> >> Nothing concrete. Sure, it's on a wishlist, but it'd require significant
> >> changes to how the runtime works.
> >> Rescaling stateful operators requires keygroups to be redistributed,
> >> you'd need to be able to change task edges dynamically, roll-back to a
> >> checkpoint without restarting tasks, ...
> >>
> >> It's less of a scheduler thing actually.
> >>
> >> An earlier step to that would be to allow recovery from an error without
> >> restarting all tasks, which would benefit all schedulers.
> >> But again bit of a moonshot.
> >>
> >>   > How well has the adaptive scheduler been tested in production? If we
> >> are intending to use it for rescale operations, I'm a bit concerned
> >> those jobs might show different behavior due to the scheduling than jobs
> >> started with the default scheduler.
> >>
> >> I don't think we got a lot of feedback so far.
> >> Outside of the limitations listed on the elastic scaling page (which I
> >> believe we'll address in due time) I'm not aware of any problems.
> >> We haven't run into any issues internally.
> >>
> >> On 02/02/2023 12:44, Maximilian Michels wrote:
> >>> +1 on improving the scheduler docs.
> >>>
> >>>> They never shared a base class since day 1. Are you maybe mixing up the 
> >>>> AdaptiveScheduler and AdaptiveBatchScheduler?
> >>> @Chesnay: Indeed, I had mixed this up. DefaultScheduler and
> >>> AdaptiveScheduler only share the SchedulerNG interface while the
> >>> DefaultScheduler and the AdaptiveBatchScheduler share a subset of the
> >>> code. Too many schedulers :)
> >>>
> >>> Thanks for clarifying the current and the intended feature set of the
> >>> adaptive scheduler!
> >>>
> >>> How well has the adaptive scheduler been tested in production? If we
> >>> are intending to use it for rescale operations, I'm a bit concerned
> >>> those jobs might show different behavior due to the scheduling than
> >>> jobs started with the default scheduler.
> >>>
> >>> If I understand correctly, the adaptive scheduler currently does a
> >>> full job restart. Is there any work planned to enable in-place
> >>> rescaling with the adaptive scheduler?
> >>>
> >>>> @max:
> >>>>     - when user repartition, we still need to restart the job, can we 
> >>>> try to
> >>>>     do this part of the work internally instead of externally, as
> >>>>     *@konstantin* said only trigger rescaling when the checkpoint or
> >>>>     retain-checkpoint is completed operations to minimize reprocessing
> >>> @ConradJam: I'm not sure I understand your question. Do you mean when
> >>> the partition strategy changes between operators? That shouldn't be
> >>> the case for Rescale (except maybe converting ForwardPartitioner to
> >>> RescalePartitioner). A more advanced rescale API could allow user
> >>> control over this but for now I think it would only support adjusting
> >>> parallelism of vertices.
> >>>
> >>> -Max
> >>>
> >>> On Thu, Feb 2, 2023 at 6:44 AM weijie guo <guoweijieres...@gmail.com> 
> >>> wrote:
> >>>> Hi David,
> >>>>
> >>>> Sorry I'm late to join discuss.
> >>>>
> >>>> +1 for having a more structure doc about scheduler ecosystem and I can 
> >>>> help to fill in the details about batch part.
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Weijie
> >>>>
> >>>>
> >>>>
> >>>> David Morávek <d...@apache.org> 于2023年2月1日周三 22:38写道：
> >>>>> It makes sense to give the whole "scheduler ecosystem," not just the
> >>>>> adaptive scheduler, a little bit more structure in the docs. We already
> >>>>> have 4 different schedulers (Default, Adaptive, AdaptiveBatch,
> >>>>> AdaptiveBatchSpeculative), and it becomes quite confusing since the 
> >>>>> details
> >>>>> are scattered around the docs. Maybe having a "Job Schedulers" subpage, 
> >>>>> the
> >>>>> same way as we have for "Resource Providers" could do the trick.
> >>>>>
> >>>>> I should be able to fill in the details about the streaming ones, but I
> >>>>> will probably need some help with the batch ones.
> >>>>>
> >>>>> As for the first FLIP, it's already prepared and we should be able to
> >>>>> publish it until Friday.
> >>>>>
> >>>>> Best,
> >>>>> D.
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 1, 2023 at 9:56 AM Gyula Fóra <gyula.f...@gmail.com> wrote:
> >>>>>
> >>>>>> Chesnay, David:
> >>>>>>
> >>>>>> Thank you guys for the extra information. We were clearly missing some
> >>>>>> context here around the scheduler related efforts and the currently
> >>>>>> available feature set.
> >>>>>>
> >>>>>> As for the concrete suggestions regarding the docs.
> >>>>>>
> >>>>>> 1. If the adaptive scheduler provides a significantly different 
> >>>>>> feature set
> >>>>>> from the default scheduler we could have its own smaller doc page 
> >>>>>> detailing
> >>>>>> the differences and why people should switch to it for streaming. This 
> >>>>>> will
> >>>>>> also help us when we are making the transition and change the default
> >>>>>> behaviour.
> >>>>>> 2. We could still have an elastic scaling page that links to the 
> >>>>>> adaptive
> >>>>>> scheduler (and vice versa) that focuses on elastic scaling + the 
> >>>>>> Kubernetes
> >>>>>> operator autoscaler for a complete picture on elastic scaling options +
> >>>>>> detailing the limitations of the different approaches.
> >>>>>>
> >>>>>> This way the Adaptive Scheduler docs will be decoupled from elastic 
> >>>>>> scaling
> >>>>>> and will result in a better understanding for the users (it sure would 
> >>>>>> have
> >>>>>> helped us here, and we are on the more advanced user side :))
> >>>>>>
> >>>>>> What do you think?
> >>>>>> Gyula
> >>>>>>
> >>>>>> On Sat, Jan 28, 2023 at 4:20 AM ConradJam <jam.gz...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Sorry I'm late to join discuss, I've gleaned a lot of useful 
> >>>>>>> information
> >>>>>>> from you guys
> >>>>>>>
> >>>>>>> *@max*
> >>>>>>>
> >>>>>>>      - when user repartition, we still need to restart the job, can 
> >>>>>>> we try
> >>>>>> to
> >>>>>>>      do this part of the work internally instead of externally, as
> >>>>>>>      *@konstantin* said only trigger rescaling when the checkpoint or
> >>>>>>>      retain-checkpoint is completed operations to minimize 
> >>>>>>> reprocessing
> >>>>>>>
> >>>>>>> *@konstantin*
> >>>>>>>
> >>>>>>>      - I think you mentioned that 2 FLIPs are being drafted which I
> >>>>>> consider
> >>>>>>>      to be the condition to achieve the *@max* goal, I would love to 
> >>>>>>> join
> >>>>>>>      this discussion and contribute it. I've tried a native 
> >>>>>>> implementation
> >>>>>> of
> >>>>>>>      this part myself, if I can help the community that's the best I 
> >>>>>>> can do
> >>>>>>>
> >>>>>>> *@chesnay*
> >>>>>>>
> >>>>>>>      - The docs section is confusion/misconceptions confusing like 
> >>>>>>> *@gyula
> >>>>>>> *say,
> >>>>>>>      I'll see if I can fix it
> >>>>>>>
> >>>>>>>
> >>>>>>> *About Rescale Api*
> >>>>>>>
> >>>>>>>     Some limitations and differences between *default* and *reactive 
> >>>>>>> mode*
> >>>>>>> were
> >>>>>>> discussed earlier, and *@chesnay* explained some of their limitations 
> >>>>>>> and
> >>>>>>> behaviors, essentially they are two different things. I agree that 
> >>>>>>> when
> >>>>>>> reactive mode is ready, it should be used as the *reactive mode* for 
> >>>>>>> the
> >>>>>>> default *stream processing* job.
> >>>>>>>     As for the *[1] **Rescale API*, as we know now it seems to be
> >>>>>> unusable, I
> >>>>>>> believe the goal of this api is to be able to do fast reparallelism. I
> >>>>>>> would like to wait until the discussion is over and the 2 draft FILPs
> >>>>>>> mentioned earlier are completed. It is not too late to make another
> >>>>>>> decision on whether to modify the *[2] **Rescale Rest API *to support 
> >>>>>>> for
> >>>>>>> parallelism modification of job vertices
> >>>>>>>
> >>>>>>>
> >>>>>>>      1.
> >>>>>>> *
> >>>>>>>
> >>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/elastic_scaling/
> >>>>>>>      <
> >>>>>>>
> >>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/elastic_scaling/
> >>>>>>>      *
> >>>>>>>      2.
> >>>>>>> *
> >>>>>>>
> >>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-rescaling
> >>>>>>>      <
> >>>>>>>
> >>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-rescaling
> >>>>>>>      *
> >>>>>>>
> >>>>>>>
> >>>>>>> Best～
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Maximilian Michels <m...@apache.org> 于2023年1月24日周二 01:08写道：
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> The current rescale API appears to be a work in progress. A couple
> >>>>>> years
> >>>>>>>> ago, we disabled access to the API [1].
> >>>>>>>>
> >>>>>>>> I'm looking into this problem as part of working on autoscaling [2]
> >>>>>> where
> >>>>>>>> we currently require a full restart of the job to apply the 
> >>>>>>>> parallelism
> >>>>>>>> overrides. This adds additional delay and comes with the caveat that 
> >>>>>>>> we
> >>>>>>>> don't know whether sufficient resources are available prior to
> >>>>>> executing
> >>>>>>>> the scaling decision. We obviously do not want to get stuck due to a
> >>>>>> lack
> >>>>>>>> of resources. So a rescale API would have to ensure enough resources
> >>>>>> are
> >>>>>>>> available prior to restarting the job.
> >>>>>>>>
> >>>>>>>> I've created an issue here:
> >>>>>>>> https://issues.apache.org/jira/browse/FLINK-30773
> >>>>>>>>
> >>>>>>>> Any comments or interest in working on this?
> >>>>>>>>
> >>>>>>>> -Max
> >>>>>>>>
> >>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-12312
> >>>>>>>> [2]
> >>>>>>>>
> >>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> >>>>>>> --
> >>>>>>> Best
> >>>>>>>
> >>>>>>> ConradJam
> >>>>>>>
>

Re: Reworking the Rescale API

Reply via email to