It is certainly true that the messaging around the AS/reactive mode wasn't good.

In part this happened because initially we only intended to advertise reactive mode (at the time), and only later figured that the AS on it's own could already be useful too.

That being said, I'm not sure how to improve the elastic scaling page, can you make specific suggestions? For example I don't see it being implied that using the adaptive schedulers uses reactive mode by default (which it doesn't). Some tight coupling is of course required because reactive mode requires the AS, but to me the page makes it clear that you can use the AS on it's own.

The only thing I can think of is making reactive mode a sub-section of the AS. I may have just been too involved to really see the problems in the docs; I'd appreciate any help you can give to improve the docs.

On 27/01/2023 10:33, Gyula Fóra wrote:
Also @David Morávek <d...@apache.org> @Chesnay Schepler <ches...@apache.org>


It would be great if you could update the respective docs page before
publishing your improvement FLIPS about the adaptive scheduler:
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/elastic_scaling/

I think many of the confusion/misconceptions stem from this docs page as it
ties the Adaptive scheduler and Reactive mode closely together.
It also clearly implies that if you use the adaptive scheduler, then
reactive mode is the default behaviour which is probably not what we want
if we want users to switch over from the current standard scheduler.

We should clearly separate the features of the Adaptive Scheduler from the
reactive mode which is tailored for very specific use cases and setups
(standalone only) and requires careful configuration before production use.
Reactive mode should be an opt-in feature.

Gyula

On Fri, Jan 27, 2023 at 10:19 AM David Morávek <d...@apache.org> wrote:

The adaptive scheduler only supports streaming jobs. That's the biggest
limitation that probably won't be fixed anytime soon.

Since FLIP-283 [1] has been accepted, I think this limitation might have
already been addressed to a certain extent. I'd be completely fine with
having a separate scheduler for batch and streaming (maybe we could build a
hybrid one at some point that automatically switches between the two).

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-283%3A+Use+adaptive+batch+scheduler+as+default+scheduler+for+batch+jobs


On Fri, Jan 27, 2023 at 9:58 AM Chesnay Schepler <ches...@apache.org>
wrote:

The adaptive scheduler only supports streaming jobs. That's the biggest
limitation that probably won't be fixed anytime soon.
The goal was though to make the adaptive scheduler the default for
streaming jobs eventually.
it was very much meant as a better version of the default scheduler for
streaming jobs.

On 26/01/2023 19:06, David Morávek wrote:
Hi Gyula,


can you please explain why the AdaptiveScheduler is not the default
scheduler?
There are still some smaller bits missing. As far as I know, the
missing
parts are:

1) Local recovery (reusing the already downloaded state files after
restart
/ rescale)
2) Support for fine-grained resource management
3) Support for the session cluster (Chesnay will be submitting a FLIP
for
this soon)

We're looking into addressing all of these limitations in the short
term.
Personally, I'd love to start a discussion about making transitioning
the
AdaptiveScheduler into a default one after those limitations are fixed.
Being able to eventually deprecate and remove the DefaultScheduler
would
simplify the code-base by a lot since there are many adapters between
new
and old interfaces (eg. SlotPool-related interfaces).

Best,
D.

On Thu, Jan 26, 2023 at 6:27 PM Gyula Fóra <gyula.f...@gmail.com>
wrote:
Chesnay,

Seems like you are suggesting that the Adaptive scheduler does
everything
the standard scheduler does and more.

I am clearly not an expert on this topic but can you please explain
why
the
AdaptiveScheduler is not the default scheduler?
If it can do everything, why do we even have 2 schedulers? Why not
simply
drop the "old" one?

That would probably clear up all confusionsthen :)

Gyula

On Thu, Jan 26, 2023 at 6:23 PM Chesnay Schepler <ches...@apache.org>
wrote:

There's the default and reactive mode; nothing else.
At it's core they are the same thing; reactive mode just cranks up
the
desired parallelism to infinity and enforces certain assumptions
(e.g.,
no active resource management).

The advantage is that the adaptive scheduler can run jobs while not
sufficient resources are available, and scale things up again once
they
are available.
This is it's core functionality, but we always intended to extend it
such that users can modify the parallelism at runtime as well.
And since the AS can already rescale jobs (and was purpose-built with
that functionality in mind), this is just a matter of exposing an API
for it. Everything else is already there.

As a concrete use-case, let's say you have an SLA that says jobs must
not be down longer than X seconds, and a TM just crashed.
If you can absolutely guarantee that your k8s cluster can provision a
new TM within X seconds, no matter what cruel reality has in store
for
you, than you /may/ not need it.
If you can't, well then here's a use-case for you.

   > Last time I looked they implemented the same interface and the
same
base class. Of course, their behavior is quite different.

They never shared a base class since day 1. Are you maybe mixing up
the
AdaptiveScheduler and AdaptiveBatchScheduler?

As for FLINK-30773, I think that should be covered.

On 26/01/2023 17:10, Maximilian Michels wrote:
Thanks for the explanation. If not for the "reactive mode", what is
the advantage of the adaptive scheduler? What other modes does it
support?

Apart from implementing the same interface the implementations of
the
adaptive and default schedulers are separate.
Last time I looked they implemented the same interface and the same
base class. Of course, their behavior is quite different.

I'm still very interested in learning about the future FLIPs
mentioned. Based on the replies, I'm assuming that they will support
the changes required for
https://issues.apache.org/jira/browse/FLINK-30773, or at least
provide
the basis for implementing them.

-Max

On Thu, Jan 26, 2023 at 4:57 PM Chesnay Schepler<ches...@apache.org
wrote:
On 26/01/2023 16:18, Maximilian Michels wrote:

I see slightly different goals for the standard and the adaptive
scheduler. The adaptive scheduler's goal is to adapt the Flink job
according to the available resources.

This is really a misconception that we just have to stomp out.

This statement only applies to reactive mode, a special mode in
which
the adaptive scheduler (AS) can run in where active resource
management
is
not supported since requesting infinite resources from k8s doesn't
really
make sense.
The AS itself can work perfectly fine with active resource
management,
and has no effect on how the RM talks to k8s. It can just keep the
job
running in cases where less than desired (==user-provided
parallelism)
resources are provided by k8s (possibly temporarily).
On 26/01/2023 16:18, Maximilian Michels wrote:

After
all, both schedulers share the same super class

Apart from implementing the same interface the implementations of
the
adaptive and default schedulers are separate.




Reply via email to