Thanks to Gyula and Max for a great start, I'll try this feature out and
I'll raise it on issue đ
Maximilian Michels äş2022ĺš´12ć15ćĽĺ¨ĺ 02:37ĺéďź
> A heads-up: Gyula just opened a PR with the code contribution based on the
> design: https://github.com/apache/flink-kubernetes-operator/pull/484
>
> We
A heads-up: Gyula just opened a PR with the code contribution based on the
design: https://github.com/apache/flink-kubernetes-operator/pull/484
We have run some tests based on the current state and achieved very good
results thus far. We were able to cut the resources of some of the
deployments by
Thanks for the reply.
Gyula and Max.
Prasanna
On Sat, 26 Nov 2022, 00:24 Maximilian Michels, wrote:
> Hi John, hi Prasanna, hi Rui,
>
> Gyula already gave great answers to your questions, just adding to it:
>
> >Whatâs the reason to add auto scaling to the Operator instead of to the
> JobMana
Hi John, hi Prasanna, hi Rui,
Gyula already gave great answers to your questions, just adding to it:
>Whatâs the reason to add auto scaling to the Operator instead of to the
JobManager?
As Gyula mentioned, the JobManager is not the ideal place, at least not
until Flink supports in-place autoscal
Hi Gyula
Thanks for the clarification!
Best
Rui Fan
On Fri, Nov 25, 2022 at 1:50 PM Gyula FĂłra wrote:
> Rui, Prasanna:
>
> I am afraid that creating a completely independent autoscaler process that
> works with any type of Flink clusters is out of scope right now due to the
> following reasons
Rui, Prasanna:
I am afraid that creating a completely independent autoscaler process that
works with any type of Flink clusters is out of scope right now due to the
following reasons:
If we were to create a new general process, we would have to implement high
availability and a pluggable mechanis
Thanks for this answer, Gyula!
-John
On Thu, Nov 24, 2022, at 14:53, Gyula FĂłra wrote:
> Hi John!
>
> Thank you for the excellent question.
>
> There are few reasons why we felt that the operator is the right place for
> this component:
>
> - Ideally the autoscaler is a separate process (an outsi
HI max,
This is a great initiative and good discussion going on.
We have set up flink cluster using Amazon ECS . So It would be good to
design in such a way that we can deploy the autoscaler in a separate docker
image which could observe the JM, JOBS and emit outputs that can use to
trigger the E
Hi Gyula, Max, John!
Thanks for the great FLIP, it's very useful for flink users.
> Ideally the autoscaler is a separate process (an outside observer)
Could we finally use the autoscaler as a outside tool? or run it as a
separate java process? If it's complex, can the part that detects
the job
Hi John!
Thank you for the excellent question.
There are few reasons why we felt that the operator is the right place for
this component:
- Ideally the autoscaler is a separate process (an outside observer) , and
the jobmanager is very much tied to the lifecycle of the job. The operator
is a pe
Hi Max,
Thanks for the FLIP!
Iâve been curious about one one point. I can imagine some good reasons for it
but wonder what you have in mind. Whatâs the reason to add auto scaling to the
Operator instead of to the JobManager?
It seems like adding that capability to the JobManager would be a big
Thanks for your comments @Dong and @Chen. It is true that not all the
details are contained in the FLIP. The document is meant as a general
design concept.
As for the rescaling time, this is going to be a configurable setting for
now but it is foreseeable that we will provide auto-tuning of this
c
>Do we think the scaler could be a plugin or hard coded ?
+1 For pluggable scaling logic.
On Mon, Nov 21, 2022 at 3:38 AM Chen Qin wrote:
> On Sun, Nov 20, 2022 at 7:25 AM Gyula FĂłra wrote:
>
> > Hi Chen!
> >
> > I think in the long term it makes sense to provide some pluggable
> > mechanisms
On Sun, Nov 20, 2022 at 7:25 AM Gyula FĂłra wrote:
> Hi Chen!
>
> I think in the long term it makes sense to provide some pluggable
> mechanisms but it's not completely trivial where exactly you would plug in
> your custom logic at this point.
>
sounds good, more specifically would be great if it
Hi Chen!
I think in the long term it makes sense to provide some pluggable
mechanisms but it's not completely trivial where exactly you would plug in
your custom logic at this point.
In any case the problems you mentioned should be solved robustly by the
algorithm itself without any customization
Hi Gyula,
Do we think the scaler could be a plugin or hard coded ?
We observed some cases scaler can't address (e.g async io dependency
service degradation or small spike that doesn't worth restarting job)
Thanks,
Chen
On Fri, Nov 18, 2022 at 1:03 AM Gyula FĂłra wrote:
> Hi Dong!
>
> Could you
Hi Gyula!
Thanks for all the explanations!
Personally, I would like to see a full story of how the algorithm works
(e.g. how it determines the estimated time for scale), how users can get
the basic information needed to monitor the health/effectiveness of
autoscaler (e.g. metrics), and how the al
Hi Dong!
Could you please confirm that your main concerns have been addressed?
Some other minor details that might not have been fully clarified:
- The prototype has been validated on some production workloads yes
- We are only planning to use metrics that are generally available and are
previo
Hi Dong!
This is not an experimental feature proposal. The implementation of the
prototype is still in an experimental phase but by the time the FLIP,
initial prototype and review is done, this should be in a good stable first
version.
This proposal is pretty general as autoscalers/tuners get as f
Hi Gyula,
If I understand correctly, this autopilot proposal is an experimental
feature and its configs/metrics are not mature enough to provide backward
compatibility yet. And the proposal provides high-level ideas of the
algorithm but it is probably too complicated to explain it end-to-end.
On
Hi Dong,
Let me address your comments.
Time for scale / backlog processing time derivation:
We can add some more details to the Flip but at this point the
implementation is actually much simpler than the algorithm to describe it.
I would not like to add more equations etc because it just overcomp
Thanks for the update! Please see comments inline.
On Tue, Nov 15, 2022 at 11:46 PM Maximilian Michels wrote:
> Of course! Let me know if your concerns are addressed. The wiki page has
> been updated.
>
> >It will be great to add this in the FLIP so that reviewers can understand
> how the source
Of course! Let me know if your concerns are addressed. The wiki page has
been updated.
>It will be great to add this in the FLIP so that reviewers can understand
how the source parallelisms are computed and how the algorithm works
end-to-end.
I've updated the FLIP page to add more details on how
Hi Maximilian,
It seems that the following comments from the previous discussions have not
been addressed yet. Any chance we can have them addressed before starting
the voting thread?
Thanks,
Dong
On Mon, Nov 7, 2022 at 2:33 AM Gyula FĂłra wrote:
> Hi Dong!
>
> Let me try to answer the question
I agree we should start the vote.
On a separate (but related) small discussion we could also decide
backporting https://issues.apache.org/jira/browse/FLINK-29501 for 1.16.1 so
that the autoscaler could be more efficiently developed and tested and to
make it 1.16 compatible.
Cheers,
Gyula
On Tue,
+1 If there are no further comments, I'll start a vote thread in the next
few days.
-Max
On Tue, Nov 15, 2022 at 2:06 PM Zheng Yu Chen wrote:
> @Gyula Have a good news, now flip-256 now is finish and merge it .
> flip-271 discussion seems to have stopped and I wonder if there are any
> othe
@Gyula Have a good news, now flip-256 now is finish and merge it .
flip-271 discussion seems to have stopped and I wonder if there are any
other comments. Can we get to the polls and start this exciting feature đ
Maybe I can get involved in developing this feature
Gyula FĂłra äş2022ĺš´11ć8ćĽĺ¨äş 18
>> # Horizontal scaling V.S. Vertical scaling
>
>True. We left out vertical scaling intentionally. For now we assume CPU /
memory is set up by the user. While definitely useful, vertical scaling
>adds another dimension to the scaling problem which we wanted to tackle
later. I'll update the FLIP to
I had 2 extra comments to Max's reply:
1. About pre-allocating resources:
This could be done through the operator when the standalone deployment mode
is used relatively easily as there we have better control of pods/resources.
2. Session jobs:
There is a FLIP (
https://cwiki.apache.org/confluence
@Yang
>Since the current auto-scaling needs to fully redeploy the application, it
may fail to start due to lack of resources.
Great suggestions. I agree that we will have to have to preallocate /
reserve resources to ensure the rescaling doesn't take longer as expected.
This is not only a problem
Thanks for the fruitful discussion and I am really excited to see that the
auto-scaling really happens for
Flink Kubernetes operator. It will be a very important step to make the
long-running Flink job more smoothly.
I just have some immature ideas and want to share them here.
# Resource Reserv
Thanks for all the interest here and for the great remarks! Gyula
already did a great job addressing the questions here. Let me try to
add additional context:
@Biao Geng:
>1. For source parallelisms, if the user configure a much larger value than
>normal, there should be very little pending rec
@Dong:
Looking at the busyTime metrics in the TaskOMetricGroup it seems that busy
time is actually defined as "not idle or (soft) backpressured" . So I think
it would give us the correct reading based on what you said about the Kafka
sink.
In any case we have to test this and if something is not
Thanks for the explanation Gyula. Please see my reply inline.
BTW, has the proposed solution been deployed and evaluated with any
production workload? If yes, I am wondering if you could share the
experience, e.g. what is the likelihood of having regression and
improvement respectively after enabl
@Guyla,
Thanks for the explanation and the follow up actions. That sounds good to
me.
Thanks,
JunRui Lee
Yanfei Lei äş2022ĺš´11ć7ćĽĺ¨ä¸ 12:20ĺéďź
> Hi Max,
>
> Thanks for the proposal. This proposal makes Flink better adapted to
> cloud-native applications!
>
> After reading the FLIP, I'm curious abo
Hi Max,
Thanks for the proposal. This proposal makes Flink better adapted to
cloud-native applications!
After reading the FLIP, I'm curious about some points:
1) It's said that "The first step is collecting metrics for all JobVertices
by combining metrics from all the runtime subtasks and comput
Hi Dong!
Let me try to answer the questions :)
1 : busyTimeMsPerSecond is not specific for CPU, it measures the time spent
in the main record processing loop for an operator if I
understand correctly. This includes IO operations too.
2: We should add this to the FLIP I agree. It would be a Durat
Hi Max,
Thank you for the proposal. The proposal tackles a very important issue for
Flink users and the design looks promising overall!
I have some questions to better understand the proposed public interfaces
and the algorithm.
1) The proposal seems to assume that the operator's busyTimeMsPerSe
@Pedro:
The current design focuses on record processing time metrics. In most cases
when we need to scale (such as too much state per operator), record
processing time actually slows, so it would detect that. Of course in the
future we can add new logic if we see something missing.
@ConradJam:
We
Hi Max
Thank you for dirver this flipďźI have some advice for this flip
Do we not only exist in the (on/off) switch, but also have one more option
for (advcie).
After the user opens (advcie), it does not actually perform AutoScaling. It
only outputs the notification form of tuning suggestions for t
@Guyla,
Thank you for your reply, the answer makes perfect sense. I have a follow up if
thatâs ok.
IIUC this FLIP uses metrics that relate to backpressure at an operator level
(records in vs out, busy time etcâŚ).
Could the FLIP also be used to auto-scale based on state-level metrics at an
ope
@JunRui:
There are 2 pieces that prevent scaling on minor load variations. Firstly
he algorithm / logic is intended to work on metrics averaged on a
configured time window (let's say last 5 minutes), this smoothens minor
variances and results in more stability. Secondly, in addition to the
utilizat
Hello,
First of all thank you for tackling this theme, it is massive boon to Flink if
it gets in.
Following up on JunRui Leeâs question.
Have you considered making metrics collection getting triggered based on events
rather than periodic checks?
I.e if input source lag is increasing for the p
Hi Max,
Thanks for writing this FLIP and initiating the discussion.
I just have a small question after reading the FLIP:
In the document, I didn't find the definition of when to trigger
autoScaling after some jobVertex reach the threshold. If I missed is,
please let me know.
IIUC, the proper tri
Hey!
Thanks for the input!
The algorithm does not really differentiate between scaling up or down as
itâs concerned about finding the right parallelism to match the target
processing rate with just enough spare capacity.
Let me try to address your specific points:
1. The backlog growth rate onl
Hi Max,
Thanks a lot for the FLIP. It is an extremely attractive feature!
Just some follow up questions/thoughts after reading the FLIP:
In the doc, the discussion of the strategy of âscaling outâ is thorough and
convincing to me but it seems that âscaling downâ is less discussed. I have 2
cen
Thanks for preparing the FLIP and kicking off the discussion, Max. Looking
forward to this. :-)
On Sat, Nov 5, 2022 at 9:27 AM Niels Basjes wrote:
> I'm really looking forward to seeing this in action.
>
> Niels
>
> On Fri, 4 Nov 2022, 19:37 Maximilian Michels, wrote:
>
>> Hi,
>>
>> I would lik
I'm really looking forward to seeing this in action.
Niels
On Fri, 4 Nov 2022, 19:37 Maximilian Michels, wrote:
> Hi,
>
> I would like to kick off the discussion on implementing autoscaling for
> Flink as part of the Flink Kubernetes operator. I've outlined an approach
> here which I find promi
Thank you Max, Gyula!
This is definitely an exciting one :)
Cheers,
Matyas
On Fri, Nov 4, 2022 at 1:16 PM Gyula FĂłra wrote:
> Hi!
>
> Thank you for the proposal Max! It is great to see this highly desired
> feature finally take shape.
>
> I think we have all the right building blocks to make t
Hi!
Thank you for the proposal Max! It is great to see this highly desired
feature finally take shape.
I think we have all the right building blocks to make this successful.
Cheers,
Gyula
On Fri, Nov 4, 2022 at 7:37 PM Maximilian Michels wrote:
> Hi,
>
> I would like to kick off the discussio
50 matches
Mail list logo