Re: [Discuss] Improving Autoscaler Behavior for Jobs with Large Number of Tasks

Sharath Wed, 03 Sep 2025 00:46:59 -0700

Hi Gyula,

Thanks for the response, let me clarify a bit.

The issue is not that Flink’s per task metrics like busyness, idleness, or
backpressure are incorrect. They are doing what they are supposed to do.
The challenge comes when we try to interpret them in aggregate for
autoscaling.

Right now, the autoscaler assumes that each task has access to one full
vCore. In practice:

   -

   With slot sharing, multiple subtasks can run on the same TaskManager,
   sharing the same vCore. Their reported busyness values do not reflect this
   contention.
   -

   When cgroups are disabled, TaskManagers can burst above their
   allocation, so task level metrics look fine even though the cluster is
   oversubscribed.
   -

   When cgroups are enabled, the opposite happens. The autoscaler may try
   to drive tasks toward a utilization target that is not physically
   achievable because the target is set at the task level rather than the
   TaskManager level.

Minimal example

[image: Brainstorming and ideation.png]

Suppose we have 2 TaskManagers, each with 1 slot, 1 vCore, and slot sharing
enabled. We run a job with 4 subtasks. Flink places 2 subtasks on each
TaskManager. For simplicity, let's assume all the subtasks have equal
processing capacity, throughput, and busyness.

   -

   Each task reports about 50 percent busy time. From the autoscaler’s
   perspective, this looks underutilized since 50 percent is below the 70
   percent target.
   -

   At the TaskManager level, the two subtasks together are already
   saturating the single vCore, excluding any background threads. In reality,
   there is no headroom left.

In this case, the metrics are individually correct, but the autoscaler
makes the wrong decision because it interprets them per task without any
CPU awareness. The parallelism determined by the autoscaler would only be
correct if each TaskManager were vertically scaled to match the number of
tasks it is running, but such a capability does not exist. Users cannot
reason about job efficiency in a consistent way across large and diverse
workloads.

That is why I was thinking of shifting the focus to TaskManager level
aggregation, summing colocated tasks’ loads so we scale on something closer
to real CPU usage. The current autoscaling algorithm can also introduce CPU
skew across TaskManagers since it does not converge to a configuration that
uses equal parallelism for tasks sharing the same slot. The proposed
algorithm would converge to a configuration that minimizes CPU skew. Of
course, this would require rigorous theoretical validation, but I wanted to
gain some acknowledgment from the community regarding existing drawbacks.

Cheers,
Sharath

On Tue, Sep 2, 2025 at 6:05 AM Gyula Fóra <[email protected]> wrote:

> Hey!
>
> I am having a bit of trouble understanding the problem with the reported
> metrics. I assume if something is off that means that the Flink reported
> busyness/idleness/backpressure metrics must be off for a given job.
>
> Could you please give us a small/minimal example of how this would be the
> case?
> Would it make sense to try to fix the problem with the Flink metrics
> themselves instead of coming up with strategies on the autoscaler side to
> work around the incorrect metrics?
>
> I think the autoscaler algorithm is sound but it assumes fundamentally
> correct metrics, so fixing those may be the simpler solution.
>
> Cheers
> Gyula
>
> On Fri, Aug 22, 2025 at 1:02 AM Sharath <[email protected]> wrote:
>
> > Hi all,
> >
> > I’ve been looking into how the autoscaler behaves with jobs that have a
> > large number of tasks and wanted to share some thoughts to start a
> > discussion.
> > The problem
> >
> > Right now, the autoscaler implicitly assumes that each task gets a full
> > second of processing time. While this works in simple cases where there
> is
> > only one task, it breaks down in other situations.
> >
> > If cgroups are not enabled on YARN, TaskManagers can use much more than
> > their allocated 1 vCore, leading to overallocation. When cgroups are
> > enabled, the autoscaler may try to maintain a target utilization that is
> > not realistically achievable due to CPU constraints. Backlog-based
> scaling
> > might keep the job running without lag, but the underlying behaviour is
> > hard to reason about.
> >
> > Another challenge is that the target utilization metric does not reflect
> > actual CPU usage. This makes it hard for users to understand how
> efficient
> > their job really is or to tune the utilization threshold effectively
> across
> > a large number of jobs. I believe this is a fundamental limitation of the
> > original research paper
> > <https://www.usenix.org/system/files/osdi18-kalavri.pdf> that the Flink
> > autoscaler is based on. In a way, the research paper implicitly assumes
> > that slot-sharing
> > <
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/finegrained_resource/
> > >
> > is disabled in Flink.
> > Potential Solution
> >
> > One idea is to make the autoscaler aware of how tasks are placed and
> shift
> > the focus to TaskManager-level utilization instead of task-level.
> >
> > For example, if sub-tasks a1, b2, and c5 are colocated on the same
> > TaskManager, their individual loads should be summed to represent the
> total
> > pressure on that TaskManager. The autoscaler could then scale based on
> the
> > average or peak pressure across all TaskManagers rather than treating
> tasks
> > in isolation.
> >
> > A possible approach would be:
> >
> >    -
> >
> >    Use existing metrics like busy time and backpressure to estimate
> >    per-task load.
> >    -
> >
> >    Group colocated tasks and aggregate their loads by TaskManager.
> >    -
> >
> >    Calculate TaskManager-level utilization and use that as the signal for
> >    scaling.
> >
> > This approach will not capture everything, such as background threads or
> > garbage collection, but it should be a step closer to aligning with real
> > CPU usage patterns.
> > Next steps
> >
> > I have not worked out all the implementation details yet, and I admit
> this
> > approach adds significantly more complexity than I'd like. But given how
> > Flink places and runs tasks with slot sharing, I think the extra
> complexity
> > might be necessary.
> >
> > I would love to get feedback from others before going further with a
> design
> > doc. Let me know what you think.
> >
> > Thanks,
> > Sharath
> >
>

Re: [Discuss] Improving Autoscaler Behavior for Jobs with Large Number of Tasks

Reply via email to