Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

David Morávek Mon, 20 Feb 2023 00:11:35 -0800

Hi everyone,

Thanks for the feedback! I've updated the FLIP to use idempotent PUT API
instead of PATCH and to properly handle lower bound settings, to support
the "pre-allocation" of the resources.


@Max

> How hard would it be to address this issue in the FLIP?

I've included this in the FLIP. It might not be too hard to implement this
in the end.

> B) drop as many superfluous task managers as needed

I've intentionally left this part out for now because this ultimately needs
to be the responsibility of the Resource Manager. After all, in the Session
Cluster scenario, the Scheduler doesn't have the bigger picture of other
tasks of other jobs running on those TMs. This will most likely be a topic
for another FLIP.

WDYT? If there are no other questions or concerns, I'd like to start the
vote on Wednesday.

Best,
D.

On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <m...@apache.org> wrote:

> I missed that the FLIP states:
>
> > Currently, even though we’d expose the lower bound for clarity and API
> completeness, we won’t allow setting it to any other value than one until
> we have full support throughout the stack.
>
> How hard would it be to address this issue in the FLIP?
>
> There is not much value to offer setting a lower bound which won't be
> respected / throw an error when it is set. If we had support for a
> lower bound, we could enforce a resource contract externally via
> setting lowerBound == upperBound. That ties back to the Rescale API
> discussion we had. I want to better understand what the major concerns
> would be around allowing this.
>
> Just to outline how I imagine the logic to work:
>
> A) The resource constraints are already met => Nothing changes
> B) More resources available than required => Cancel the job, drop as
> many superfluous task managers as needed, restart the job
> C) Less resources available than required => Acquire new task
> managers, wait for them to register, cancel and restart the job
>
> I'm open to helping out with the implementation.
>
> -Max
>
> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <m...@apache.org> wrote:
> >
> > Based on further discussion I had with Chesnay on this PR [1], I think
> > jobs would currently go into a restarting state after the resource
> > requirements have changed. This wouldn't achieve what we had in mind,
> > i.e. sticking to the old resource requirements until enough slots are
> > available to fulfil the new resource requirements. So this may not be
> > 100% what we need but it could be extended to do what we want.
> >
> > -Max
> >
> > [1] https://github.com/apache/flink/pull/21908#discussion_r1104792362
> >
> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <m...@apache.org>
> wrote:
> > >
> > > Hi David,
> > >
> > > This is awesome! Great writeup and demo. This is pretty much what we
> > > need for the autoscaler as part of the Flink Kubernetes operator [1].
> > > Scaling Flink jobs effectively is hard but fortunately we have solved
> > > the issue as part of the Flink Kubernetes operator. The only critical
> > > piece we are missing is a better way to execute scaling decisions, as
> > > discussed in [2].
> > >
> > > Looking at your proposal, we would set lowerBound == upperBound for
> > > the parallelism because we want to fully determine the parallelism
> > > externally based on the scaling metrics. Does that sound right?
> > >
> > > What is the timeline for these changes? Is there a JIRA?
> > >
> > > Cheers,
> > > Max
> > >
> > > [1]
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> > > [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > >
> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <xiangyu...@gmail.com>
> wrote:
> > > >
> > > > Hi David,
> > > >
> > > > Thanks for your reply.  I think your response totally make sense.
> This
> > > > flip targets on declaring required resource to ResourceManager
> instead of
> > > > using  ResourceManager to add/remove TMs directly.
> > > >
> > > > Best,
> > > > Xiangyu
> > > >
> > > >
> > > >
> > > > David Morávek <david.mora...@gmail.com> 于2023年2月13日周一 15:46写道：
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > @Shammon
> > > > >
> > > > > I'm not entirely sure what "config file" you're referring to. You
> can, of
> > > > > course, override the default parallelism in "flink-conf.yaml", but
> for
> > > > > sinks and sources, the parallelism needs to be tweaked on the
> connector
> > > > > level ("WITH" statement).
> > > > >
> > > > > This is something that should be achieved with tooling around
> Flink. We
> > > > > want to provide an API on the lowest level that generalizes well.
> Achieving
> > > > > what you're describing should be straightforward with this API.
> > > > >
> > > > > @Xiangyu
> > > > >
> > > > > Is it possible for this REST API to declare TM resources in the
> future?
> > > > >
> > > > >
> > > > > Would you like to add/remove TMs if you use an active Resource
> Manager?
> > > > > This would be out of the scope of this effort since it targets the
> > > > > scheduler component only (we make no assumptions about the used
> Resource
> > > > > Manager). Also, the AdaptiveScheduler is only intended to be used
> for
> > > > > Streaming.
> > > > >
> > > > >  And for streaming jobs, I'm wondering if there is any situation
> we need to
> > > > > > rescale the TM resources of a flink cluster at first and then the
> > > > > adaptive
> > > > > > scheduler will rescale the per-vertex ResourceProfiles
> accordingly.
> > > > > >
> > > > >
> > > > > We plan on adding support for the ResourceProfiles (dynamic slot
> > > > > allocation) as the next step. Again we won't make any assumptions
> about the
> > > > > used Resource Manager. In other words, this effort ends by
> declaring
> > > > > desired resources to the Resource Manager.
> > > > >
> > > > > Does that make sense?
> > > > >
> > > > > @Matthias
> > > > >
> > > > > We've done another pass on the proposed API and currently lean
> towards
> > > > > having an idempotent PUT API.
> > > > > - We don't care too much about multiple writers' scenarios in
> terms of who
> > > > > can write an authoritative payload; this is up to the user of the
> API to
> > > > > figure out
> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
> switching to PUT
> > > > > API seems to do the trick
> > > > > - We won't allow partial "payloads" anymore, meaning you need to
> define
> > > > > requirements for all vertices in the JobGraph; This is completely
> fine for
> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you can use
> the GET
> > > > > endpoint and tweak the response to avoid writing the whole payload
> by hand.
> > > > >
> > > > > WDYT?
> > > > >
> > > > >
> > > > > Best,
> > > > > D.
> > > > >
> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
> xiangyu...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Thanks for creating this flip. I think this work it is very
> useful,
> > > > > > especially in autoscaling scenario.  I would like to share some
> questions
> > > > > > from my view.
> > > > > >
> > > > > > 1, Is it possible for this REST API to declare TM resources in
> the
> > > > > future?
> > > > > > I'm asking because we are building the autoscaling feature for
> Flink OLAP
> > > > > > Session Cluster in ByteDance. We need to rescale the cluster's
> resource
> > > > > on
> > > > > > TM level instead of Job level. It would be very helpful if we
> have a REST
> > > > > > API for out external Autoscaling service to use.
> > > > > >
> > > > > > 2, And for streaming jobs, I'm wondering if there is any
> situation we
> > > > > need
> > > > > > to rescale the TM resources of a flink cluster at first and then
> the
> > > > > > adaptive scheduler will rescale the per-vertex ResourceProfiles
> > > > > > accordingly.
> > > > > >
> > > > > > best.
> > > > > > Xiangyu
> > > > > >
> > > > > > Shammon FY <zjur...@gmail.com> 于2023年2月9日周四 11:31写道：
> > > > > >
> > > > > > > Hi David
> > > > > > >
> > > > > > > Thanks for your answer.
> > > > > > >
> > > > > > > > Can you elaborate more about how you'd intend to use the
> endpoint? I
> > > > > > > think we can ultimately introduce a way of re-declaring
> "per-vertex
> > > > > > > defaults," but I'd like to understand the use case bit more
> first.
> > > > > > >
> > > > > > > For this issue, I mainly consider the consistency of user
> configuration
> > > > > > and
> > > > > > > job runtime. For sql jobs, users usually set specific
> parallelism for
> > > > > > > source and sink, and set a global parallelism for other
> operators.
> > > > > These
> > > > > > > config items are stored in a config file. For some
> high-priority jobs,
> > > > > > > users may want to manage them manually.
> > > > > > > 1. When users need to scale the parallelism, they should
> update the
> > > > > > config
> > > > > > > file and restart flink job, which may take a long time.
> > > > > > > 2. After providing the REST API, users can just send a request
> to the
> > > > > job
> > > > > > > via REST API quickly after updating the config file.
> > > > > > > The configuration in the running job and config file should be
> the
> > > > > same.
> > > > > > > What do you think of this?
> > > > > > >
> > > > > > > best.
> > > > > > > Shammon
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
> david.mora...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > Let's try to answer the questions one by one.
> > > > > > > >
> > > > > > > > *@ConradJam*
> > > > > > > >
> > > > > > > > when the number of "slots" is insufficient, can we can stop
> users
> > > > > > > rescaling
> > > > > > > > > or throw something to tell user "less avaliable slots to
> upgrade,
> > > > > > > please
> > > > > > > > > checkout your alivalbe slots" ?
> > > > > > > > >
> > > > > > > >
> > > > > > > > The main property of AdaptiveScheduler is that it can adapt
> to
> > > > > > "available
> > > > > > > > resources," which means you're still able to make progress
> even
> > > > > though
> > > > > > > you
> > > > > > > > didn't get all the slots you've asked for. Let's break down
> the pros
> > > > > > and
> > > > > > > > cons of this property.
> > > > > > > >
> > > > > > > > - (plus) If you lose a TM for some reason, you can still
> recover even
> > > > > > if
> > > > > > > it
> > > > > > > > doesn't come back. We still need to give it some time to
> eliminate
> > > > > > > > unnecessary rescaling, which can be controlled by setting
> > > > > > > > "resource-stabilization-timeout."
> > > > > > > > - (plus) The resources can arrive with a significant delay.
> For
> > > > > > example,
> > > > > > > > you're unable to spawn enough TMs on time because you've run
> out of
> > > > > > > > resources in your k8s cluster, and you need to wait for the
> cluster
> > > > > > auto
> > > > > > > > scaler to kick in and add new nodes to the cluster. In this
> scenario,
> > > > > > > > you'll be able to start making progress faster, at the cost
> of
> > > > > multiple
> > > > > > > > rescalings (once the remaining resources arrive).
> > > > > > > > - (plus) This plays well with the declarative manner of
> today's
> > > > > > > > infrastructure. For example, you tell k8s that you need 10
> TMs, and
> > > > > > > you'll
> > > > > > > > eventually get them.
> > > > > > > > - (minus) In the case of large state jobs, the cost of
> multiple
> > > > > > > rescalings
> > > > > > > > might outweigh the above.
> > > > > > > >
> > > > > > > > We've already touched on the solution to this problem on the
> FLIP.
> > > > > > Please
> > > > > > > > notice the parallelism knob being a range with a lower and
> upper
> > > > > bound.
> > > > > > > > Setting both the lower and upper bound to the same value
> could give
> > > > > the
> > > > > > > > behavior you're describing at the cost of giving up some
> properties
> > > > > > that
> > > > > > > AS
> > > > > > > > gives you (you'd be falling back to the DefaultScheduler's
> behavior).
> > > > > > > >
> > > > > > > > when user upgrade job-vertx-parallelism . I want to have an
> interface
> > > > > > to
> > > > > > > > > query the current update parallel execution status, so
> that the
> > > > > user
> > > > > > or
> > > > > > > > > program can understand the current status
> > > > > > > > >
> > > > > > > >
> > > > > > > > This is a misunderstanding. We're not introducing the RESCALE
> > > > > endpoint.
> > > > > > > > This endpoint allows you to re-declare the resources needed
> to run
> > > > > the
> > > > > > > job.
> > > > > > > > Once you reach the desired resources (you get more resources
> than the
> > > > > > > lower
> > > > > > > > bound defines), your job will run.
> > > > > > > >
> > > > > > > > We can expose a similar endpoint to "resource requirements"
> to give
> > > > > you
> > > > > > > an
> > > > > > > > overview of the resources the vertices already have. You can
> already
> > > > > > get
> > > > > > > > this from the REST API, so exposing this in yet another way
> should be
> > > > > > > > considered carefully.
> > > > > > > >
> > > > > > > > *@Matthias*
> > > > > > > >
> > > > > > > > I'm wondering whether it makes sense to add some kind of
> resource ID
> > > > > to
> > > > > > > the
> > > > > > > > > REST API.
> > > > > > > >
> > > > > > > >
> > > > > > > > That's a good question. I want to think about that and get
> back to
> > > > > the
> > > > > > > > question later. My main struggle when thinking about this
> is, "if
> > > > > this
> > > > > > > > would be an idempotent POST endpoint," would it be any
> different?
> > > > > > > >
> > > > > > > > How often do we allow resource requirements to be changed?
> > > > > > > >
> > > > > > > >
> > > > > > > > There shall be no rate limiting on the FLINK side. If this is
> > > > > something
> > > > > > > > your environment needs, you can achieve it on a different
> layer ("we
> > > > > > > can't
> > > > > > > > have FLINK to do everything").
> > > > > > > >
> > > > > > > > Versioning the JobGraph in the JobGraphStore rather than
> overwriting
> > > > > it
> > > > > > > > > might be an idea.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This sounds interesting since it would be closer to the
> JobGraph
> > > > > being
> > > > > > > > immutable. The main problem I see here is that this would
> introduce a
> > > > > > > > BW-incompatible change so it might be a topic for follow-up
> FLIP.
> > > > > > > >
> > > > > > > > I'm just wondering whether we bundle two things together
> that are
> > > > > > > actually
> > > > > > > > > separate
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yup, this is how we think about it as well. The main
> question is,
> > > > > "who
> > > > > > > > should be responsible for bookkeeping 1) the JobGraph and 2)
> the
> > > > > > > > JobResourceRequirements". The JobMaster would be the right
> place for
> > > > > > > both,
> > > > > > > > but it's currently not the case, and we're tightly coupling
> the
> > > > > > > dispatcher
> > > > > > > > with the JobMaster.
> > > > > > > >
> > > > > > > > Initially, we tried to introduce a separate HA component in
> JobMaster
> > > > > > for
> > > > > > > > bookkeeping the JobResourceRequirements, but that proved to
> be a more
> > > > > > > > significant effort adding additional mess to the already
> messy HA
> > > > > > > > ecosystem. Another approach we've discussed was mutating the
> JobGraph
> > > > > > and
> > > > > > > > setting JRR into the JobGraph structure itself.
> > > > > > > >
> > > > > > > > The middle ground for keeping this effort reasonably sized
> and not
> > > > > > > > violating "we want to keep JG immutable" too much is keeping
> the
> > > > > > > > JobResourceRequirements separate as an internal config
> option in
> > > > > > > JobGraph's
> > > > > > > > configuration.
> > > > > > > >
> > > > > > > > We ultimately need to rethink the tight coupling of
> Dispatcher and
> > > > > > > > JobMaster, but it needs to be a separate effort.
> > > > > > > >
> > > > > > > > ...also considering the amount of data that can be stored in
> a
> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
> requirement
> > > > > > change
> > > > > > > as
> > > > > > > > > proposed in my previous item is an option for us.
> > > > > > > > >
> > > > > > > >
> > > > > > > > AFAIK we're only storing pointers to the S3 objects in HA
> metadata,
> > > > > so
> > > > > > we
> > > > > > > > should be okay with having larger structures for now.
> > > > > > > >
> > > > > > > > Updating the JobGraphStore means adding more requests to the
> HA
> > > > > backend
> > > > > > > > API.
> > > > > > > > >
> > > > > > > >
> > > > > > > > It's fine unless you intend to override the resource
> requirements a
> > > > > few
> > > > > > > > times per second.
> > > > > > > >
> > > > > > > > *@Shammon*
> > > > > > > >
> > > > > > > > How about adding some more information such as vertex type
> > > > > > > > >
> > > > > > > >
> > > > > > > > Since it was intended as a "debug" endpoint, it makes
> complete sense!
> > > > > > > >
> > > > > > > >  For sql jobs, we always use a unified parallelism for most
> vertices.
> > > > > > Can
> > > > > > > > > we provide them with a more convenient setting method
> instead of
> > > > > each
> > > > > > > > one?
> > > > > > > >
> > > > > > > >
> > > > > > > > I completely feel with this. The main thoughts when
> designing the API
> > > > > > > were:
> > > > > > > > - We want to keep it clean and easy to understand.
> > > > > > > > - Global parallelism can be modeled using per-vertex
> parallelism but
> > > > > > not
> > > > > > > > the other way around.
> > > > > > > > - The API will be used by external tooling (operator, auto
> scaler).
> > > > > > > >
> > > > > > > > Can you elaborate more about how you'd intend to use the
> endpoint? I
> > > > > > > think
> > > > > > > > we can ultimately introduce a way of re-declaring "per-vertex
> > > > > > defaults,"
> > > > > > > > but I'd like to understand the use case bit more first.
> > > > > > > >
> > > > > > > > *@Weijie*
> > > > > > > >
> > > > > > > > What is the default value here (based on what
> configuration), or just
> > > > > > > > > infinite?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Currently, for the lower bound, it's always one, and for the
> upper
> > > > > > bound,
> > > > > > > > it's either parallelism (if defined) or the maxParallelism
> of the
> > > > > > vertex
> > > > > > > in
> > > > > > > > JobGraph. This question might be another signal for making
> the
> > > > > defaults
> > > > > > > > explicit (see the answer to Shammon's question above).
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks, everyone, for your initial thoughts!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > D.
> > > > > > > >
> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
> guoweijieres...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks David for driving this. This is a very valuable
> work,
> > > > > > especially
> > > > > > > > for
> > > > > > > > > cloud native environment.
> > > > > > > > >
> > > > > > > > > >> How about adding some more information such as vertex
> type
> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > > > resource-requirements`? For users, only vertex-id may be
> difficult
> > > > > to
> > > > > > > > > understand.
> > > > > > > > >
> > > > > > > > > +1 for this suggestion, including jobvertex's name in the
> response
> > > > > > body
> > > > > > > > is
> > > > > > > > > more
> > > > > > > > > user-friendly.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I saw this sentence in FLIP: "Setting the upper bound to
> -1 will
> > > > > > reset
> > > > > > > > the
> > > > > > > > > value to the default setting."  What is the default value
> here
> > > > > (based
> > > > > > > on
> > > > > > > > > what configuration), or just infinite?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > > Weijie
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Shammon FY <zjur...@gmail.com> 于2023年2月6日周一 18:06写道：
> > > > > > > > >
> > > > > > > > > > Hi David
> > > > > > > > > >
> > > > > > > > > > Thanks for initiating this discussion. I think declaring
> job
> > > > > > resource
> > > > > > > > > > requirements by REST API is very valuable. I just left
> some
> > > > > > comments
> > > > > > > as
> > > > > > > > > > followed
> > > > > > > > > >
> > > > > > > > > > 1) How about adding some more information such as vertex
> type
> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > > > > resource-requirements`? For users, only vertex-id may be
> > > > > difficult
> > > > > > to
> > > > > > > > > > understand.
> > > > > > > > > >
> > > > > > > > > > 2) For sql jobs, we always use a unified parallelism for
> most
> > > > > > > vertices.
> > > > > > > > > Can
> > > > > > > > > > we provide them with a more convenient setting method
> instead of
> > > > > > each
> > > > > > > > > one?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Shammon
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> > > > > > matthias.p...@aiven.io
> > > > > > > > > > .invalid>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
> promising and
> > > > > > useful
> > > > > > > > to
> > > > > > > > > > > have. Here are some thoughts from my side (some of
> them might
> > > > > be
> > > > > > > > > rather a
> > > > > > > > > > > follow-up and not necessarily part of this FLIP):
> > > > > > > > > > > - I'm wondering whether it makes sense to add some
> kind of
> > > > > > resource
> > > > > > > > ID
> > > > > > > > > to
> > > > > > > > > > > the REST API. This would give Flink a tool to verify
> the PATCH
> > > > > > > > request
> > > > > > > > > of
> > > > > > > > > > > the external system in a compare-and-set kind of
> manner. AFAIU,
> > > > > > the
> > > > > > > > > > process
> > > > > > > > > > > requires the external system to retrieve the resource
> > > > > > requirements
> > > > > > > > > first
> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC>
> would be sent
> > > > > > > along
> > > > > > > > > as
> > > > > > > > > > a
> > > > > > > > > > > unique identifier for the provided setup. It's
> essentially the
> > > > > > > > version
> > > > > > > > > ID
> > > > > > > > > > > of the currently deployed resource requirement
> configuration.
> > > > > > Flink
> > > > > > > > > > doesn't
> > > > > > > > > > > know whether the external system would use the provided
> > > > > > information
> > > > > > > > in
> > > > > > > > > > some
> > > > > > > > > > > way to derive a new set of resource requirements for
> this job.
> > > > > > The
> > > > > > > > > > > subsequent PATCH request with updated resource
> requirements
> > > > > would
> > > > > > > > > include
> > > > > > > > > > > the previously retrieved resource ID <ABC>. The PATCH
> call
> > > > > would
> > > > > > > fail
> > > > > > > > > if
> > > > > > > > > > > there was a concurrent PATCH call in between
> indicating to the
> > > > > > > > external
> > > > > > > > > > > system that the resource requirements were concurrently
> > > > > updated.
> > > > > > > > > > > - How often do we allow resource requirements to be
> changed?
> > > > > That
> > > > > > > > > > question
> > > > > > > > > > > might make my previous comment on the resource ID
> obsolete
> > > > > > because
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > just make any PATCH call fail if there was a resource
> > > > > requirement
> > > > > > > > > update
> > > > > > > > > > > within a certain time frame before the request. But
> such a time
> > > > > > > > period
> > > > > > > > > is
> > > > > > > > > > > something we might want to make configurable then, I
> guess.
> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore rather
> than
> > > > > > > > overwriting
> > > > > > > > > it
> > > > > > > > > > > might be an idea. This would enable us to provide
> resource
> > > > > > > > requirement
> > > > > > > > > > > changes in the UI or through the REST API. It is
> related to a
> > > > > > > problem
> > > > > > > > > > > around keeping track of the exception history within
> the
> > > > > > > > > > AdaptiveScheduler
> > > > > > > > > > > and also having to consider multiple versions of a
> JobGraph.
> > > > > But
> > > > > > > for
> > > > > > > > > that
> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right now.
> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore makes
> sense. I'm
> > > > > > just
> > > > > > > > > > > wondering whether we bundle two things together that
> are
> > > > > actually
> > > > > > > > > > separate:
> > > > > > > > > > > The business logic and the execution configuration (the
> > > > > resource
> > > > > > > > > > > requirements). I'm aware that this is not a flaw of
> the current
> > > > > > > FLIP
> > > > > > > > > but
> > > > > > > > > > > rather something that was not necessary to address in
> the past
> > > > > > > > because
> > > > > > > > > > the
> > > > > > > > > > > JobGraph was kind of static. I don't remember whether
> that was
> > > > > > > > already
> > > > > > > > > > > discussed while working on the AdaptiveScheduler for
> FLIP-160
> > > > > > [1].
> > > > > > > > > Maybe,
> > > > > > > > > > > I'm missing some functionality here that requires us
> to have
> > > > > > > > everything
> > > > > > > > > > in
> > > > > > > > > > > one place. But it feels like updating the entire
> JobGraph which
> > > > > > > could
> > > > > > > > > be
> > > > > > > > > > > actually a "config change" is not reasonable. ...also
> > > > > considering
> > > > > > > the
> > > > > > > > > > > amount of data that can be stored in a
> ConfigMap/ZooKeeper node
> > > > > > if
> > > > > > > > > > > versioning the resource requirement change as proposed
> in my
> > > > > > > previous
> > > > > > > > > > item
> > > > > > > > > > > is an option for us.
> > > > > > > > > > > - Updating the JobGraphStore means adding more
> requests to the
> > > > > HA
> > > > > > > > > backend
> > > > > > > > > > > API. There were some concerns shared in the discussion
> thread
> > > > > [2]
> > > > > > > for
> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in the
> past with
> > > > > > too
> > > > > > > > many
> > > > > > > > > > > calls. Eventhough, it's more likely to be caused by
> > > > > > checkpointing,
> > > > > > > I
> > > > > > > > > > still
> > > > > > > > > > > wanted to bring it up. We're working on a standardized
> > > > > > performance
> > > > > > > > test
> > > > > > > > > > to
> > > > > > > > > > > prepare going forward with FLIP-270 [3] right now.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Matthias
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > > > > > > > > [2]
> > > > > > >
> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > > > > > > > > [3]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
> jam.gz...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi David:
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for drive this flip, which helps less flink
> > > > > shutdown
> > > > > > > time
> > > > > > > > > > > >
> > > > > > > > > > > > for this flip, I would like to make a few idea on
> share
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - when the number of "slots" is insufficient, can
> we can
> > > > > > stop
> > > > > > > > > users
> > > > > > > > > > > >    rescaling or throw something to tell user "less
> avaliable
> > > > > > > slots
> > > > > > > > to
> > > > > > > > > > > > upgrade,
> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or we
> could have a
> > > > > > > > request
> > > > > > > > > > > >    switch(true/false) to allow this behavior
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism . I
> want to have
> > > > > > an
> > > > > > > > > > > interface
> > > > > > > > > > > >    to query the current update parallel execution
> status, so
> > > > > > that
> > > > > > > > the
> > > > > > > > > > > user
> > > > > > > > > > > > or
> > > > > > > > > > > >    program can understand the current status
> > > > > > > > > > > >    - I want to have an interface to query the
> current update
> > > > > > > > > > parallelism
> > > > > > > > > > > >    execution status. This also helps similar to *[1]
> Flink
> > > > > K8S
> > > > > > > > > > Operator*
> > > > > > > > > > > >    management
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > {
> > > > > > > > > > > >   status: Failed
> > > > > > > > > > > >   reason: "less avaliable slots to upgrade, please
> checkout
> > > > > > your
> > > > > > > > > > alivalbe
> > > > > > > > > > > > slots"
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - *Pending*: this job now is join the upgrade
> queue,it
> > > > > will
> > > > > > be
> > > > > > > > > > update
> > > > > > > > > > > >    later
> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > > > > > > > > >    - *Finished*: finish do it
> > > > > > > > > > > >    - *Failed* : something have wrong,so this job is
> not
> > > > > > alivable
> > > > > > > > > > upgrade
> > > > > > > > > > > >
> > > > > > > > > > > > I want to supplement my above content in flip, what
> do you
> > > > > > think
> > > > > > > ?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    1.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > David Morávek <d...@apache.org> 于2023年2月3日周五
> 16:42写道：
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
> declaring
> > > > > > resource
> > > > > > > > > > > > requirements
> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to be a
> clear need
> > > > > > for
> > > > > > > > this
> > > > > > > > > > API
> > > > > > > > > > > > > based on the discussion on the "Reworking the
> Rescale API"
> > > > > > [2]
> > > > > > > > > > thread.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Before we get started, this work is heavily based
> on the
> > > > > > > > prototype
> > > > > > > > > > [3]
> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is being
> published
> > > > > > with
> > > > > > > > his
> > > > > > > > > > > > consent.
> > > > > > > > > > > > > Big shoutout to him!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Last and not least, thanks to Chesnay and Roman
> for the
> > > > > > initial
> > > > > > > > > > reviews
> > > > > > > > > > > > and
> > > > > > > > > > > > > discussions.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The best start would be watching a short demo [4]
> that I've
> > > > > > > > > recorded,
> > > > > > > > > > > > which
> > > > > > > > > > > > > illustrates newly added capabilities (rescaling
> the running
> > > > > > > job,
> > > > > > > > > > > handing
> > > > > > > > > > > > > back resources to the RM, and session cluster
> support).
> > > > > > > > > > > > >
> > > > > > > > > > > > > The intuition behind the FLIP is being able to
> define
> > > > > > resource
> > > > > > > > > > > > requirements
> > > > > > > > > > > > > ("resource boundaries") externally that the
> > > > > AdaptiveScheduler
> > > > > > > can
> > > > > > > > > > > > navigate
> > > > > > > > > > > > > within. This is a building block for higher-level
> efforts
> > > > > > such
> > > > > > > as
> > > > > > > > > an
> > > > > > > > > > > > > external Autoscaler. The natural extension of this
> work
> > > > > would
> > > > > > > be
> > > > > > > > to
> > > > > > > > > > > allow
> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looking forward to your thoughts; any feedback is
> > > > > > appreciated!
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > > > > > > > [2]
> > > > > > > > >
> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > > > > > > > > [3]
> https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > > > > > > > > [4]
> > > > > > > > > > > >
> > > > > > > > >
> > > > > >
> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > D.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best
> > > > > > > > > > > >
> > > > > > > > > > > > ConradJam
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Reply via email to