Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Matthias Pohl Mon, 20 Feb 2023 05:34:38 -0800

Thanks for your clarifications, David. I don't have any additional major
points to add. One thing about the FLIP: The RPC layer API for updating the
JRR returns a future with a JRR? I don't see value in returning a JRR here
since it's an idempotent operation? Wouldn't it be enough to return
CompletableFuture<Void> here? Or am I missing something?


Matthias

On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <[email protected]> wrote:

> Thanks David! If we could get the pre-allocation working as part of
> the FLIP, that would be great.
>
> Concerning the downscale case, I agree this is a special case for the
> (single-job) application mode where we could re-allocate slots in a
> way that could leave entire task managers unoccupied which we would
> then be able to release. The goal essentially is to reduce slot
> fragmentation on scale down by packing the slots efficiently. The
> easiest way to add this optimization when running in application mode
> would be to drop as many task managers during the restart such that
> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
> this independently of the FLIP.
>
> Feel free to start the vote.
>
> -Max
>
> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <[email protected]> wrote:
> >
> > Hi everyone,
> >
> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT API
> instead of PATCH and to properly handle lower bound settings, to support
> the "pre-allocation" of the resources.
> >
> > @Max
> >
> > > How hard would it be to address this issue in the FLIP?
> >
> > I've included this in the FLIP. It might not be too hard to implement
> this in the end.
> >
> > > B) drop as many superfluous task managers as needed
> >
> > I've intentionally left this part out for now because this ultimately
> needs to be the responsibility of the Resource Manager. After all, in the
> Session Cluster scenario, the Scheduler doesn't have the bigger picture of
> other tasks of other jobs running on those TMs. This will most likely be a
> topic for another FLIP.
> >
> > WDYT? If there are no other questions or concerns, I'd like to start the
> vote on Wednesday.
> >
> > Best,
> > D.
> >
> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <[email protected]>
> wrote:
> >>
> >> I missed that the FLIP states:
> >>
> >> > Currently, even though we’d expose the lower bound for clarity and
> API completeness, we won’t allow setting it to any other value than one
> until we have full support throughout the stack.
> >>
> >> How hard would it be to address this issue in the FLIP?
> >>
> >> There is not much value to offer setting a lower bound which won't be
> >> respected / throw an error when it is set. If we had support for a
> >> lower bound, we could enforce a resource contract externally via
> >> setting lowerBound == upperBound. That ties back to the Rescale API
> >> discussion we had. I want to better understand what the major concerns
> >> would be around allowing this.
> >>
> >> Just to outline how I imagine the logic to work:
> >>
> >> A) The resource constraints are already met => Nothing changes
> >> B) More resources available than required => Cancel the job, drop as
> >> many superfluous task managers as needed, restart the job
> >> C) Less resources available than required => Acquire new task
> >> managers, wait for them to register, cancel and restart the job
> >>
> >> I'm open to helping out with the implementation.
> >>
> >> -Max
> >>
> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <[email protected]>
> wrote:
> >> >
> >> > Based on further discussion I had with Chesnay on this PR [1], I think
> >> > jobs would currently go into a restarting state after the resource
> >> > requirements have changed. This wouldn't achieve what we had in mind,
> >> > i.e. sticking to the old resource requirements until enough slots are
> >> > available to fulfil the new resource requirements. So this may not be
> >> > 100% what we need but it could be extended to do what we want.
> >> >
> >> > -Max
> >> >
> >> > [1] https://github.com/apache/flink/pull/21908#discussion_r1104792362
> >> >
> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <[email protected]>
> wrote:
> >> > >
> >> > > Hi David,
> >> > >
> >> > > This is awesome! Great writeup and demo. This is pretty much what we
> >> > > need for the autoscaler as part of the Flink Kubernetes operator
> [1].
> >> > > Scaling Flink jobs effectively is hard but fortunately we have
> solved
> >> > > the issue as part of the Flink Kubernetes operator. The only
> critical
> >> > > piece we are missing is a better way to execute scaling decisions,
> as
> >> > > discussed in [2].
> >> > >
> >> > > Looking at your proposal, we would set lowerBound == upperBound for
> >> > > the parallelism because we want to fully determine the parallelism
> >> > > externally based on the scaling metrics. Does that sound right?
> >> > >
> >> > > What is the timeline for these changes? Is there a JIRA?
> >> > >
> >> > > Cheers,
> >> > > Max
> >> > >
> >> > > [1]
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> >> > > [2]
> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> > >
> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <[email protected]>
> wrote:
> >> > > >
> >> > > > Hi David,
> >> > > >
> >> > > > Thanks for your reply.  I think your response totally make
> sense.  This
> >> > > > flip targets on declaring required resource to ResourceManager
> instead of
> >> > > > using  ResourceManager to add/remove TMs directly.
> >> > > >
> >> > > > Best,
> >> > > > Xiangyu
> >> > > >
> >> > > >
> >> > > >
> >> > > > David Morávek <[email protected]> 于2023年2月13日周一 15:46写道：
> >> > > >
> >> > > > > Hi everyone,
> >> > > > >
> >> > > > > @Shammon
> >> > > > >
> >> > > > > I'm not entirely sure what "config file" you're referring to.
> You can, of
> >> > > > > course, override the default parallelism in "flink-conf.yaml",
> but for
> >> > > > > sinks and sources, the parallelism needs to be tweaked on the
> connector
> >> > > > > level ("WITH" statement).
> >> > > > >
> >> > > > > This is something that should be achieved with tooling around
> Flink. We
> >> > > > > want to provide an API on the lowest level that generalizes
> well. Achieving
> >> > > > > what you're describing should be straightforward with this API.
> >> > > > >
> >> > > > > @Xiangyu
> >> > > > >
> >> > > > > Is it possible for this REST API to declare TM resources in the
> future?
> >> > > > >
> >> > > > >
> >> > > > > Would you like to add/remove TMs if you use an active Resource
> Manager?
> >> > > > > This would be out of the scope of this effort since it targets
> the
> >> > > > > scheduler component only (we make no assumptions about the used
> Resource
> >> > > > > Manager). Also, the AdaptiveScheduler is only intended to be
> used for
> >> > > > > Streaming.
> >> > > > >
> >> > > > >  And for streaming jobs, I'm wondering if there is any
> situation we need to
> >> > > > > > rescale the TM resources of a flink cluster at first and then
> the
> >> > > > > adaptive
> >> > > > > > scheduler will rescale the per-vertex ResourceProfiles
> accordingly.
> >> > > > > >
> >> > > > >
> >> > > > > We plan on adding support for the ResourceProfiles (dynamic slot
> >> > > > > allocation) as the next step. Again we won't make any
> assumptions about the
> >> > > > > used Resource Manager. In other words, this effort ends by
> declaring
> >> > > > > desired resources to the Resource Manager.
> >> > > > >
> >> > > > > Does that make sense?
> >> > > > >
> >> > > > > @Matthias
> >> > > > >
> >> > > > > We've done another pass on the proposed API and currently lean
> towards
> >> > > > > having an idempotent PUT API.
> >> > > > > - We don't care too much about multiple writers' scenarios in
> terms of who
> >> > > > > can write an authoritative payload; this is up to the user of
> the API to
> >> > > > > figure out
> >> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
> switching to PUT
> >> > > > > API seems to do the trick
> >> > > > > - We won't allow partial "payloads" anymore, meaning you need
> to define
> >> > > > > requirements for all vertices in the JobGraph; This is
> completely fine for
> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you can
> use the GET
> >> > > > > endpoint and tweak the response to avoid writing the whole
> payload by hand.
> >> > > > >
> >> > > > > WDYT?
> >> > > > >
> >> > > > >
> >> > > > > Best,
> >> > > > > D.
> >> > > > >
> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
> [email protected]>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi David,
> >> > > > > >
> >> > > > > > Thanks for creating this flip. I think this work it is very
> useful,
> >> > > > > > especially in autoscaling scenario.  I would like to share
> some questions
> >> > > > > > from my view.
> >> > > > > >
> >> > > > > > 1, Is it possible for this REST API to declare TM resources
> in the
> >> > > > > future?
> >> > > > > > I'm asking because we are building the autoscaling feature
> for Flink OLAP
> >> > > > > > Session Cluster in ByteDance. We need to rescale the
> cluster's resource
> >> > > > > on
> >> > > > > > TM level instead of Job level. It would be very helpful if we
> have a REST
> >> > > > > > API for out external Autoscaling service to use.
> >> > > > > >
> >> > > > > > 2, And for streaming jobs, I'm wondering if there is any
> situation we
> >> > > > > need
> >> > > > > > to rescale the TM resources of a flink cluster at first and
> then the
> >> > > > > > adaptive scheduler will rescale the per-vertex
> ResourceProfiles
> >> > > > > > accordingly.
> >> > > > > >
> >> > > > > > best.
> >> > > > > > Xiangyu
> >> > > > > >
> >> > > > > > Shammon FY <[email protected]> 于2023年2月9日周四 11:31写道：
> >> > > > > >
> >> > > > > > > Hi David
> >> > > > > > >
> >> > > > > > > Thanks for your answer.
> >> > > > > > >
> >> > > > > > > > Can you elaborate more about how you'd intend to use the
> endpoint? I
> >> > > > > > > think we can ultimately introduce a way of re-declaring
> "per-vertex
> >> > > > > > > defaults," but I'd like to understand the use case bit more
> first.
> >> > > > > > >
> >> > > > > > > For this issue, I mainly consider the consistency of user
> configuration
> >> > > > > > and
> >> > > > > > > job runtime. For sql jobs, users usually set specific
> parallelism for
> >> > > > > > > source and sink, and set a global parallelism for other
> operators.
> >> > > > > These
> >> > > > > > > config items are stored in a config file. For some
> high-priority jobs,
> >> > > > > > > users may want to manage them manually.
> >> > > > > > > 1. When users need to scale the parallelism, they should
> update the
> >> > > > > > config
> >> > > > > > > file and restart flink job, which may take a long time.
> >> > > > > > > 2. After providing the REST API, users can just send a
> request to the
> >> > > > > job
> >> > > > > > > via REST API quickly after updating the config file.
> >> > > > > > > The configuration in the running job and config file should
> be the
> >> > > > > same.
> >> > > > > > > What do you think of this?
> >> > > > > > >
> >> > > > > > > best.
> >> > > > > > > Shammon
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
> [email protected]>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi everyone,
> >> > > > > > > >
> >> > > > > > > > Let's try to answer the questions one by one.
> >> > > > > > > >
> >> > > > > > > > *@ConradJam*
> >> > > > > > > >
> >> > > > > > > > when the number of "slots" is insufficient, can we can
> stop users
> >> > > > > > > rescaling
> >> > > > > > > > > or throw something to tell user "less avaliable slots
> to upgrade,
> >> > > > > > > please
> >> > > > > > > > > checkout your alivalbe slots" ?
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > The main property of AdaptiveScheduler is that it can
> adapt to
> >> > > > > > "available
> >> > > > > > > > resources," which means you're still able to make
> progress even
> >> > > > > though
> >> > > > > > > you
> >> > > > > > > > didn't get all the slots you've asked for. Let's break
> down the pros
> >> > > > > > and
> >> > > > > > > > cons of this property.
> >> > > > > > > >
> >> > > > > > > > - (plus) If you lose a TM for some reason, you can still
> recover even
> >> > > > > > if
> >> > > > > > > it
> >> > > > > > > > doesn't come back. We still need to give it some time to
> eliminate
> >> > > > > > > > unnecessary rescaling, which can be controlled by setting
> >> > > > > > > > "resource-stabilization-timeout."
> >> > > > > > > > - (plus) The resources can arrive with a significant
> delay. For
> >> > > > > > example,
> >> > > > > > > > you're unable to spawn enough TMs on time because you've
> run out of
> >> > > > > > > > resources in your k8s cluster, and you need to wait for
> the cluster
> >> > > > > > auto
> >> > > > > > > > scaler to kick in and add new nodes to the cluster. In
> this scenario,
> >> > > > > > > > you'll be able to start making progress faster, at the
> cost of
> >> > > > > multiple
> >> > > > > > > > rescalings (once the remaining resources arrive).
> >> > > > > > > > - (plus) This plays well with the declarative manner of
> today's
> >> > > > > > > > infrastructure. For example, you tell k8s that you need
> 10 TMs, and
> >> > > > > > > you'll
> >> > > > > > > > eventually get them.
> >> > > > > > > > - (minus) In the case of large state jobs, the cost of
> multiple
> >> > > > > > > rescalings
> >> > > > > > > > might outweigh the above.
> >> > > > > > > >
> >> > > > > > > > We've already touched on the solution to this problem on
> the FLIP.
> >> > > > > > Please
> >> > > > > > > > notice the parallelism knob being a range with a lower
> and upper
> >> > > > > bound.
> >> > > > > > > > Setting both the lower and upper bound to the same value
> could give
> >> > > > > the
> >> > > > > > > > behavior you're describing at the cost of giving up some
> properties
> >> > > > > > that
> >> > > > > > > AS
> >> > > > > > > > gives you (you'd be falling back to the
> DefaultScheduler's behavior).
> >> > > > > > > >
> >> > > > > > > > when user upgrade job-vertx-parallelism . I want to have
> an interface
> >> > > > > > to
> >> > > > > > > > > query the current update parallel execution status, so
> that the
> >> > > > > user
> >> > > > > > or
> >> > > > > > > > > program can understand the current status
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > This is a misunderstanding. We're not introducing the
> RESCALE
> >> > > > > endpoint.
> >> > > > > > > > This endpoint allows you to re-declare the resources
> needed to run
> >> > > > > the
> >> > > > > > > job.
> >> > > > > > > > Once you reach the desired resources (you get more
> resources than the
> >> > > > > > > lower
> >> > > > > > > > bound defines), your job will run.
> >> > > > > > > >
> >> > > > > > > > We can expose a similar endpoint to "resource
> requirements" to give
> >> > > > > you
> >> > > > > > > an
> >> > > > > > > > overview of the resources the vertices already have. You
> can already
> >> > > > > > get
> >> > > > > > > > this from the REST API, so exposing this in yet another
> way should be
> >> > > > > > > > considered carefully.
> >> > > > > > > >
> >> > > > > > > > *@Matthias*
> >> > > > > > > >
> >> > > > > > > > I'm wondering whether it makes sense to add some kind of
> resource ID
> >> > > > > to
> >> > > > > > > the
> >> > > > > > > > > REST API.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > That's a good question. I want to think about that and
> get back to
> >> > > > > the
> >> > > > > > > > question later. My main struggle when thinking about this
> is, "if
> >> > > > > this
> >> > > > > > > > would be an idempotent POST endpoint," would it be any
> different?
> >> > > > > > > >
> >> > > > > > > > How often do we allow resource requirements to be changed?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > There shall be no rate limiting on the FLINK side. If
> this is
> >> > > > > something
> >> > > > > > > > your environment needs, you can achieve it on a different
> layer ("we
> >> > > > > > > can't
> >> > > > > > > > have FLINK to do everything").
> >> > > > > > > >
> >> > > > > > > > Versioning the JobGraph in the JobGraphStore rather than
> overwriting
> >> > > > > it
> >> > > > > > > > > might be an idea.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > This sounds interesting since it would be closer to the
> JobGraph
> >> > > > > being
> >> > > > > > > > immutable. The main problem I see here is that this would
> introduce a
> >> > > > > > > > BW-incompatible change so it might be a topic for
> follow-up FLIP.
> >> > > > > > > >
> >> > > > > > > > I'm just wondering whether we bundle two things together
> that are
> >> > > > > > > actually
> >> > > > > > > > > separate
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Yup, this is how we think about it as well. The main
> question is,
> >> > > > > "who
> >> > > > > > > > should be responsible for bookkeeping 1) the JobGraph and
> 2) the
> >> > > > > > > > JobResourceRequirements". The JobMaster would be the
> right place for
> >> > > > > > > both,
> >> > > > > > > > but it's currently not the case, and we're tightly
> coupling the
> >> > > > > > > dispatcher
> >> > > > > > > > with the JobMaster.
> >> > > > > > > >
> >> > > > > > > > Initially, we tried to introduce a separate HA component
> in JobMaster
> >> > > > > > for
> >> > > > > > > > bookkeeping the JobResourceRequirements, but that proved
> to be a more
> >> > > > > > > > significant effort adding additional mess to the already
> messy HA
> >> > > > > > > > ecosystem. Another approach we've discussed was mutating
> the JobGraph
> >> > > > > > and
> >> > > > > > > > setting JRR into the JobGraph structure itself.
> >> > > > > > > >
> >> > > > > > > > The middle ground for keeping this effort reasonably
> sized and not
> >> > > > > > > > violating "we want to keep JG immutable" too much is
> keeping the
> >> > > > > > > > JobResourceRequirements separate as an internal config
> option in
> >> > > > > > > JobGraph's
> >> > > > > > > > configuration.
> >> > > > > > > >
> >> > > > > > > > We ultimately need to rethink the tight coupling of
> Dispatcher and
> >> > > > > > > > JobMaster, but it needs to be a separate effort.
> >> > > > > > > >
> >> > > > > > > > ...also considering the amount of data that can be stored
> in a
> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
> requirement
> >> > > > > > change
> >> > > > > > > as
> >> > > > > > > > > proposed in my previous item is an option for us.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > AFAIK we're only storing pointers to the S3 objects in HA
> metadata,
> >> > > > > so
> >> > > > > > we
> >> > > > > > > > should be okay with having larger structures for now.
> >> > > > > > > >
> >> > > > > > > > Updating the JobGraphStore means adding more requests to
> the HA
> >> > > > > backend
> >> > > > > > > > API.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > It's fine unless you intend to override the resource
> requirements a
> >> > > > > few
> >> > > > > > > > times per second.
> >> > > > > > > >
> >> > > > > > > > *@Shammon*
> >> > > > > > > >
> >> > > > > > > > How about adding some more information such as vertex type
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Since it was intended as a "debug" endpoint, it makes
> complete sense!
> >> > > > > > > >
> >> > > > > > > >  For sql jobs, we always use a unified parallelism for
> most vertices.
> >> > > > > > Can
> >> > > > > > > > > we provide them with a more convenient setting method
> instead of
> >> > > > > each
> >> > > > > > > > one?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > I completely feel with this. The main thoughts when
> designing the API
> >> > > > > > > were:
> >> > > > > > > > - We want to keep it clean and easy to understand.
> >> > > > > > > > - Global parallelism can be modeled using per-vertex
> parallelism but
> >> > > > > > not
> >> > > > > > > > the other way around.
> >> > > > > > > > - The API will be used by external tooling (operator,
> auto scaler).
> >> > > > > > > >
> >> > > > > > > > Can you elaborate more about how you'd intend to use the
> endpoint? I
> >> > > > > > > think
> >> > > > > > > > we can ultimately introduce a way of re-declaring
> "per-vertex
> >> > > > > > defaults,"
> >> > > > > > > > but I'd like to understand the use case bit more first.
> >> > > > > > > >
> >> > > > > > > > *@Weijie*
> >> > > > > > > >
> >> > > > > > > > What is the default value here (based on what
> configuration), or just
> >> > > > > > > > > infinite?
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Currently, for the lower bound, it's always one, and for
> the upper
> >> > > > > > bound,
> >> > > > > > > > it's either parallelism (if defined) or the
> maxParallelism of the
> >> > > > > > vertex
> >> > > > > > > in
> >> > > > > > > > JobGraph. This question might be another signal for
> making the
> >> > > > > defaults
> >> > > > > > > > explicit (see the answer to Shammon's question above).
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Thanks, everyone, for your initial thoughts!
> >> > > > > > > >
> >> > > > > > > > Best,
> >> > > > > > > > D.
> >> > > > > > > >
> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
> [email protected]
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Thanks David for driving this. This is a very valuable
> work,
> >> > > > > > especially
> >> > > > > > > > for
> >> > > > > > > > > cloud native environment.
> >> > > > > > > > >
> >> > > > > > > > > >> How about adding some more information such as
> vertex type
> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> >> > > > > > > > > resource-requirements`? For users, only vertex-id may
> be difficult
> >> > > > > to
> >> > > > > > > > > understand.
> >> > > > > > > > >
> >> > > > > > > > > +1 for this suggestion, including jobvertex's name in
> the response
> >> > > > > > body
> >> > > > > > > > is
> >> > > > > > > > > more
> >> > > > > > > > > user-friendly.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper bound
> to -1 will
> >> > > > > > reset
> >> > > > > > > > the
> >> > > > > > > > > value to the default setting."  What is the default
> value here
> >> > > > > (based
> >> > > > > > > on
> >> > > > > > > > > what configuration), or just infinite?
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Best regards,
> >> > > > > > > > >
> >> > > > > > > > > Weijie
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Shammon FY <[email protected]> 于2023年2月6日周一 18:06写道：
> >> > > > > > > > >
> >> > > > > > > > > > Hi David
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks for initiating this discussion. I think
> declaring job
> >> > > > > > resource
> >> > > > > > > > > > requirements by REST API is very valuable. I just
> left some
> >> > > > > > comments
> >> > > > > > > as
> >> > > > > > > > > > followed
> >> > > > > > > > > >
> >> > > > > > > > > > 1) How about adding some more information such as
> vertex type
> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
> jobs
> >> > > > > > > > > > resource-requirements`? For users, only vertex-id may
> be
> >> > > > > difficult
> >> > > > > > to
> >> > > > > > > > > > understand.
> >> > > > > > > > > >
> >> > > > > > > > > > 2) For sql jobs, we always use a unified parallelism
> for most
> >> > > > > > > vertices.
> >> > > > > > > > > Can
> >> > > > > > > > > > we provide them with a more convenient setting method
> instead of
> >> > > > > > each
> >> > > > > > > > > one?
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > Best,
> >> > > > > > > > > > Shammon
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> >> > > > > > [email protected]
> >> > > > > > > > > > .invalid>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
> promising and
> >> > > > > > useful
> >> > > > > > > > to
> >> > > > > > > > > > > have. Here are some thoughts from my side (some of
> them might
> >> > > > > be
> >> > > > > > > > > rather a
> >> > > > > > > > > > > follow-up and not necessarily part of this FLIP):
> >> > > > > > > > > > > - I'm wondering whether it makes sense to add some
> kind of
> >> > > > > > resource
> >> > > > > > > > ID
> >> > > > > > > > > to
> >> > > > > > > > > > > the REST API. This would give Flink a tool to
> verify the PATCH
> >> > > > > > > > request
> >> > > > > > > > > of
> >> > > > > > > > > > > the external system in a compare-and-set kind of
> manner. AFAIU,
> >> > > > > > the
> >> > > > > > > > > > process
> >> > > > > > > > > > > requires the external system to retrieve the
> resource
> >> > > > > > requirements
> >> > > > > > > > > first
> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC>
> would be sent
> >> > > > > > > along
> >> > > > > > > > > as
> >> > > > > > > > > > a
> >> > > > > > > > > > > unique identifier for the provided setup. It's
> essentially the
> >> > > > > > > > version
> >> > > > > > > > > ID
> >> > > > > > > > > > > of the currently deployed resource requirement
> configuration.
> >> > > > > > Flink
> >> > > > > > > > > > doesn't
> >> > > > > > > > > > > know whether the external system would use the
> provided
> >> > > > > > information
> >> > > > > > > > in
> >> > > > > > > > > > some
> >> > > > > > > > > > > way to derive a new set of resource requirements
> for this job.
> >> > > > > > The
> >> > > > > > > > > > > subsequent PATCH request with updated resource
> requirements
> >> > > > > would
> >> > > > > > > > > include
> >> > > > > > > > > > > the previously retrieved resource ID <ABC>. The
> PATCH call
> >> > > > > would
> >> > > > > > > fail
> >> > > > > > > > > if
> >> > > > > > > > > > > there was a concurrent PATCH call in between
> indicating to the
> >> > > > > > > > external
> >> > > > > > > > > > > system that the resource requirements were
> concurrently
> >> > > > > updated.
> >> > > > > > > > > > > - How often do we allow resource requirements to be
> changed?
> >> > > > > That
> >> > > > > > > > > > question
> >> > > > > > > > > > > might make my previous comment on the resource ID
> obsolete
> >> > > > > > because
> >> > > > > > > we
> >> > > > > > > > > > could
> >> > > > > > > > > > > just make any PATCH call fail if there was a
> resource
> >> > > > > requirement
> >> > > > > > > > > update
> >> > > > > > > > > > > within a certain time frame before the request. But
> such a time
> >> > > > > > > > period
> >> > > > > > > > > is
> >> > > > > > > > > > > something we might want to make configurable then,
> I guess.
> >> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore
> rather than
> >> > > > > > > > overwriting
> >> > > > > > > > > it
> >> > > > > > > > > > > might be an idea. This would enable us to provide
> resource
> >> > > > > > > > requirement
> >> > > > > > > > > > > changes in the UI or through the REST API. It is
> related to a
> >> > > > > > > problem
> >> > > > > > > > > > > around keeping track of the exception history
> within the
> >> > > > > > > > > > AdaptiveScheduler
> >> > > > > > > > > > > and also having to consider multiple versions of a
> JobGraph.
> >> > > > > But
> >> > > > > > > for
> >> > > > > > > > > that
> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right now.
> >> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore makes
> sense. I'm
> >> > > > > > just
> >> > > > > > > > > > > wondering whether we bundle two things together
> that are
> >> > > > > actually
> >> > > > > > > > > > separate:
> >> > > > > > > > > > > The business logic and the execution configuration
> (the
> >> > > > > resource
> >> > > > > > > > > > > requirements). I'm aware that this is not a flaw of
> the current
> >> > > > > > > FLIP
> >> > > > > > > > > but
> >> > > > > > > > > > > rather something that was not necessary to address
> in the past
> >> > > > > > > > because
> >> > > > > > > > > > the
> >> > > > > > > > > > > JobGraph was kind of static. I don't remember
> whether that was
> >> > > > > > > > already
> >> > > > > > > > > > > discussed while working on the AdaptiveScheduler
> for FLIP-160
> >> > > > > > [1].
> >> > > > > > > > > Maybe,
> >> > > > > > > > > > > I'm missing some functionality here that requires
> us to have
> >> > > > > > > > everything
> >> > > > > > > > > > in
> >> > > > > > > > > > > one place. But it feels like updating the entire
> JobGraph which
> >> > > > > > > could
> >> > > > > > > > > be
> >> > > > > > > > > > > actually a "config change" is not reasonable.
> ...also
> >> > > > > considering
> >> > > > > > > the
> >> > > > > > > > > > > amount of data that can be stored in a
> ConfigMap/ZooKeeper node
> >> > > > > > if
> >> > > > > > > > > > > versioning the resource requirement change as
> proposed in my
> >> > > > > > > previous
> >> > > > > > > > > > item
> >> > > > > > > > > > > is an option for us.
> >> > > > > > > > > > > - Updating the JobGraphStore means adding more
> requests to the
> >> > > > > HA
> >> > > > > > > > > backend
> >> > > > > > > > > > > API. There were some concerns shared in the
> discussion thread
> >> > > > > [2]
> >> > > > > > > for
> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in
> the past with
> >> > > > > > too
> >> > > > > > > > many
> >> > > > > > > > > > > calls. Eventhough, it's more likely to be caused by
> >> > > > > > checkpointing,
> >> > > > > > > I
> >> > > > > > > > > > still
> >> > > > > > > > > > > wanted to bring it up. We're working on a
> standardized
> >> > > > > > performance
> >> > > > > > > > test
> >> > > > > > > > > > to
> >> > > > > > > > > > > prepare going forward with FLIP-270 [3] right now.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Best,
> >> > > > > > > > > > > Matthias
> >> > > > > > > > > > >
> >> > > > > > > > > > > [1]
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> >> > > > > > > > > > > [2]
> >> > > > > > >
> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> >> > > > > > > > > > > [3]
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
> [email protected]
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi David:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thank you for drive this flip, which helps less
> flink
> >> > > > > shutdown
> >> > > > > > > time
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > for this flip, I would like to make a few idea on
> share
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >    - when the number of "slots" is insufficient,
> can we can
> >> > > > > > stop
> >> > > > > > > > > users
> >> > > > > > > > > > > >    rescaling or throw something to tell user
> "less avaliable
> >> > > > > > > slots
> >> > > > > > > > to
> >> > > > > > > > > > > > upgrade,
> >> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or we
> could have a
> >> > > > > > > > request
> >> > > > > > > > > > > >    switch(true/false) to allow this behavior
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism . I
> want to have
> >> > > > > > an
> >> > > > > > > > > > > interface
> >> > > > > > > > > > > >    to query the current update parallel execution
> status, so
> >> > > > > > that
> >> > > > > > > > the
> >> > > > > > > > > > > user
> >> > > > > > > > > > > > or
> >> > > > > > > > > > > >    program can understand the current status
> >> > > > > > > > > > > >    - I want to have an interface to query the
> current update
> >> > > > > > > > > > parallelism
> >> > > > > > > > > > > >    execution status. This also helps similar to
> *[1] Flink
> >> > > > > K8S
> >> > > > > > > > > > Operator*
> >> > > > > > > > > > > >    management
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > {
> >> > > > > > > > > > > >   status: Failed
> >> > > > > > > > > > > >   reason: "less avaliable slots to upgrade,
> please checkout
> >> > > > > > your
> >> > > > > > > > > > alivalbe
> >> > > > > > > > > > > > slots"
> >> > > > > > > > > > > > }
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >    - *Pending*: this job now is join the upgrade
> queue,it
> >> > > > > will
> >> > > > > > be
> >> > > > > > > > > > update
> >> > > > > > > > > > > >    later
> >> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it
> finish
> >> > > > > > > > > > > >    - *Finished*: finish do it
> >> > > > > > > > > > > >    - *Failed* : something have wrong,so this job
> is not
> >> > > > > > alivable
> >> > > > > > > > > > upgrade
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I want to supplement my above content in flip,
> what do you
> >> > > > > > think
> >> > > > > > > ?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >    1.
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > >
> >> > > > >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > David Morávek <[email protected]> 于2023年2月3日周五
> 16:42写道：
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Hi everyone,
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
> declaring
> >> > > > > > resource
> >> > > > > > > > > > > > requirements
> >> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to be a
> clear need
> >> > > > > > for
> >> > > > > > > > this
> >> > > > > > > > > > API
> >> > > > > > > > > > > > > based on the discussion on the "Reworking the
> Rescale API"
> >> > > > > > [2]
> >> > > > > > > > > > thread.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Before we get started, this work is heavily
> based on the
> >> > > > > > > > prototype
> >> > > > > > > > > > [3]
> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is being
> published
> >> > > > > > with
> >> > > > > > > > his
> >> > > > > > > > > > > > consent.
> >> > > > > > > > > > > > > Big shoutout to him!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay and Roman
> for the
> >> > > > > > initial
> >> > > > > > > > > > reviews
> >> > > > > > > > > > > > and
> >> > > > > > > > > > > > > discussions.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > The best start would be watching a short demo
> [4] that I've
> >> > > > > > > > > recorded,
> >> > > > > > > > > > > > which
> >> > > > > > > > > > > > > illustrates newly added capabilities (rescaling
> the running
> >> > > > > > > job,
> >> > > > > > > > > > > handing
> >> > > > > > > > > > > > > back resources to the RM, and session cluster
> support).
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > The intuition behind the FLIP is being able to
> define
> >> > > > > > resource
> >> > > > > > > > > > > > requirements
> >> > > > > > > > > > > > > ("resource boundaries") externally that the
> >> > > > > AdaptiveScheduler
> >> > > > > > > can
> >> > > > > > > > > > > > navigate
> >> > > > > > > > > > > > > within. This is a building block for
> higher-level efforts
> >> > > > > > such
> >> > > > > > > as
> >> > > > > > > > > an
> >> > > > > > > > > > > > > external Autoscaler. The natural extension of
> this work
> >> > > > > would
> >> > > > > > > be
> >> > > > > > > > to
> >> > > > > > > > > > > allow
> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Looking forward to your thoughts; any feedback
> is
> >> > > > > > appreciated!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > [1]
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> >> > > > > > > > > > > > > [2]
> >> > > > > > > > >
> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> > > > > > > > > > > > > [3]
> https://github.com/tillrohrmann/flink/tree/autoscaling
> >> > > > > > > > > > > > > [4]
> >> > > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > >
> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Best,
> >> > > > > > > > > > > > > D.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > --
> >> > > > > > > > > > > > Best
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > ConradJam
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Reply via email to