Trystan created FLINK-35285:
-------------------------------
Summary: Autoscaler key group optimization can interfere with
scale-down.max-factor
Key: FLINK-35285
URL: https://issues.apache.org/jira/browse/FLINK-35285
Project: Flink
Issue Type: Bug
Reporter: Trystan
When setting a less aggressive scale down limit, the key group optimization can
prevent a vertex from scaling down at all. It will hunt from target upwards to
maxParallelism/2, and will always find the same parallelism again.
A simple test trying to scale down from a parallelism of 60 with a
scale-down.max-factor of 0.2:
{code:java}
assertEquals(48, JobVertexScaler.scale(60, inputShipStrategies, 360, .8, 8,
360)); {code}
It seems reasonable to make a good attempt to spread data across subtasks, but
not at the expense of total deadlock. The problem is that during scale down it
doesn't actually ensure that it newParallelism will be < currentParallelism.
Clunky, but something to ensure it can make at least some progress. There is
another test that now fails, but just to illustrate the point:
{code:java}
for (int p = newParallelism; p <= maxParallelism / 2 && p <= upperBound; p++) {
if ((scaleFactor < 1 && p < currentParallelism) || (scaleFactor > 1 && p >
currentParallelism)) {
if (maxParallelism % p == 0) {
return p;
}
}
} {code}
Perhaps this is by design and not a bug, but total failure to scale down in
order to keep optimized key groups does not seem ideal.
Key group optimization block:
https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296C1-L303C10
--
This message was sent by Atlassian Jira
(v8.20.10#820010)