Thanks for the answer, David!
It sounds like there is a race condition, but it’s a known issue not specific
to this FLIP, and the failure case isn’t too bad. I’m satisfied with that.
Thanks,
John
On Thu, Feb 23, 2023, at 10:39, David Morávek wrote:
> Hi Everyone,
>
> @John
>
> This is a problem
I agree that it is useful to have a configurable lower bound. Thanks
for looking into it as part of a follow up!
No objections from my side to move forward with the vote.
-Max
On Tue, Feb 28, 2023 at 1:36 PM David Morávek wrote:
>
> > I suppose we could further remove the min because it would a
> I suppose we could further remove the min because it would always be
safer to scale down if resources are not available than not to run at
all [1].
Apart from what @Roman has already mentioned, there are still cases where
we're certain that there is no point in running the jobs with resources
lo
Hi,
Thanks for the update, I think distinguishing the rescaling behaviour and
the desired parallelism declaration is important.
Having the ability to specify min parallelism might be useful in
environments with multiple jobs: Scheduler will then have an option to stop
the less suitable job.
In ot
Hi David,
Thanks for the update! We consider using the new declarative resource
API for autoscaling. Currently, we treat a scaling decision as a new
deployment which means surrendering all resources to Kubernetes and
subsequently reallocating them for the rescaled deployment. The
declarative resou
Hi Everyone,
We had some more talks about the pre-allocation of resources with @Max, and
here is the final state that we've converged to for now:
The vital thing to note about the new API is that it's declarative, meaning
we're declaring the desired state to which we want our job to converge; If,
Hi Everyone,
@John
This is a problem that we've spent some time trying to crack; in the end,
we've decided to go against doing any upgrades to JobGraphStore from
JobMaster to avoid having multiple writers that are guarded by different
leader election lock (Dispatcher and JobMaster might live in a
Thanks for the FLIP, David!
I just had one small question. IIUC, the REST API PUT request will go through
the new DispatcherGateway method to be handled. Then, after validation, the
dispatcher would call the new JobMasterGateway method to actually update the
job.
Which component will write th
Thanks for your clarifications, David. I don't have any additional major
points to add. One thing about the FLIP: The RPC layer API for updating the
JRR returns a future with a JRR? I don't see value in returning a JRR here
since it's an idempotent operation? Wouldn't it be enough to return
Complet
Thanks David! If we could get the pre-allocation working as part of
the FLIP, that would be great.
Concerning the downscale case, I agree this is a special case for the
(single-job) application mode where we could re-allocate slots in a
way that could leave entire task managers unoccupied which we
Hi everyone,
Thanks for the feedback! I've updated the FLIP to use idempotent PUT API
instead of PATCH and to properly handle lower bound settings, to support
the "pre-allocation" of the resources.
@Max
> How hard would it be to address this issue in the FLIP?
I've included this in the FLIP. It
I missed that the FLIP states:
> Currently, even though we’d expose the lower bound for clarity and API
> completeness, we won’t allow setting it to any other value than one until we
> have full support throughout the stack.
How hard would it be to address this issue in the FLIP?
There is not
Based on further discussion I had with Chesnay on this PR [1], I think
jobs would currently go into a restarting state after the resource
requirements have changed. This wouldn't achieve what we had in mind,
i.e. sticking to the old resource requirements until enough slots are
available to fulfil t
Hi David,
This is awesome! Great writeup and demo. This is pretty much what we
need for the autoscaler as part of the Flink Kubernetes operator [1].
Scaling Flink jobs effectively is hard but fortunately we have solved
the issue as part of the Flink Kubernetes operator. The only critical
piece we
Hi David,
Thanks for your reply. I think your response totally make sense. This
flip targets on declaring required resource to ResourceManager instead of
using ResourceManager to add/remove TMs directly.
Best,
Xiangyu
David Morávek 于2023年2月13日周一 15:46写道:
> Hi everyone,
>
> @Shammon
>
> I'
Hi everyone,
@Shammon
I'm not entirely sure what "config file" you're referring to. You can, of
course, override the default parallelism in "flink-conf.yaml", but for
sinks and sources, the parallelism needs to be tweaked on the connector
level ("WITH" statement).
This is something that should b
Hi David,
Thanks for creating this flip. I think this work it is very useful,
especially in autoscaling scenario. I would like to share some questions
from my view.
1, Is it possible for this REST API to declare TM resources in the future?
I'm asking because we are building the autoscaling featu
Hi David
Thanks for your answer.
> Can you elaborate more about how you'd intend to use the endpoint? I
think we can ultimately introduce a way of re-declaring "per-vertex
defaults," but I'd like to understand the use case bit more first.
For this issue, I mainly consider the consistency of user
Hi everyone,
Let's try to answer the questions one by one.
*@ConradJam*
when the number of "slots" is insufficient, can we can stop users rescaling
> or throw something to tell user "less avaliable slots to upgrade, please
> checkout your alivalbe slots" ?
>
The main property of AdaptiveSchedul
Thanks David for driving this. This is a very valuable work, especially for
cloud native environment.
>> How about adding some more information such as vertex type
(SOURCE/MAP/JOIN and .etc) in the response of `get jobs
resource-requirements`? For users, only vertex-id may be difficult to
understa
Hi David
Thanks for initiating this discussion. I think declaring job resource
requirements by REST API is very valuable. I just left some comments as
followed
1) How about adding some more information such as vertex type
(SOURCE/MAP/JOIN and .etc) in the response of `get jobs
resource-requiremen
Thanks David for creating this FLIP. It sounds promising and useful to
have. Here are some thoughts from my side (some of them might be rather a
follow-up and not necessarily part of this FLIP):
- I'm wondering whether it makes sense to add some kind of resource ID to
the REST API. This would give
Hi David:
Thank you for drive this flip, which helps less flink shutdown time
for this flip, I would like to make a few idea on share
- when the number of "slots" is insufficient, can we can stop users
rescaling or throw something to tell user "less avaliable slots to upgrade,
please c
Hi everyone,
This FLIP [1] introduces a new REST API for declaring resource requirements
for the Adaptive Scheduler. There seems to be a clear need for this API
based on the discussion on the "Reworking the Rescale API" [2] thread.
Before we get started, this work is heavily based on the prototyp
24 matches
Mail list logo