Thanks Yuepeng for the PR and starting this discussion!

And thanks Gyula and Yuanfeng for the input!

I also agree to fix this behaviour in the 1.x line.

The adaptive scheduler and rescaling API provide powerful capabilities to
increase or decrease parallelism.

The main benefit I understand of decreasing parallelism is saving resources.
If decreasing parallelism can't save resources, why do users decrease it?
This is why I think releasing TM resources when decreasing parallelism is
a basic capability that the Adaptive Scheduler should have.

Please correct me if I miss anything, thanks~

Also, I believe it does not work as the user expects. Because this
behaviour
was reported multiple times in the flink community, such as:
FLINK-33977[1],
FLINK-35594[2], FLINK-35903[3] and Slack channel[4].
And 1.20.x is a LTS version, so I agree to fix it in the 1.x line.

[1] https://issues.apache.org/jira/browse/FLINK-33977
[2] https://issues.apache.org/jira/browse/FLINK-35594
[3] https://issues.apache.org/jira/browse/FLINK-35903
[4] https://apache-flink.slack.com/archives/C03G7LJTS2G/p1729167222445569

Best,
Rui

On Wed, Nov 6, 2024 at 4:15 PM yuanfeng hu <yuanf...@apache.org> wrote:

> > Is it considered an error if the adaptive scheduler fails to release the
> task manager during scaling?
>
> +1 . When we enable adaptive mode and perform scaling operations on tasks,
> a significant part of the goal is to reduce resource usage for the tasks.
> However, due to some logic in the adaptive scheduler's scheduling process,
> the task manager cannot be released, and the ultimate goal cannot be
> achieved. Therefore, I consider this to be a mistake.
>
> Additionally, many tasks are currently running in this mode and will
> continue to run for quite a long time (many users are in this situation).
> So whether or not it is considered a bug, I believe we need to fix it in
> the 1.x version.
>
> Yuepeng Pan <panyuep...@apache.org> 于2024年11月6日周三 14:32写道:
>
> > Hi, community.
> >
> >
> >
> >
> > When working on ticket[1] we have received some lively discussions and
> > valuable
> > feedback[2](thanks for Matthias, Rui, Gyula, Maximilian, Tison, etc.),
> the
> > main issues are that:
> >
> > When the job runs in an application cluster, could the default behavior
> of
> > AdaptiveScheduler not actively releasing Taskmanagers resources during
> > downscaling be considered a bug?
> >
> > If so,should we fix it in flink 1.x?
> >
> >
> >
> > I’d like to start a discussion to hear more comments about it to define
> > the next step and I have sorted out some information in the doc[3]
> > regarding this discussion for you.
> >
> >
> >
> > Looking forward to your comments and attention.
> >
> > Thank you.
> >
> > Best,
> > Yuepeng Pan
> >
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-33977
> >
> > [2] https://github.com/apache/flink/pull/25218#issuecomment-2401913141
> >
> > [3]
> >
> https://docs.google.com/document/d/1Rwwl2aGVz9g5kUJFMP5GMlJwzEO_a-eo4gPf7gITpdw/edit?tab=t.0#heading=h.s4i4hehbbli5
> >
> >
> >
>
> --
> Best,
> Yuanfeng
>

Reply via email to