[ https://issues.apache.org/jira/browse/FLINK-35285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868093#comment-17868093 ]
Trystan edited comment on FLINK-35285 at 7/23/24 3:03 PM: ---------------------------------------------------------- {noformat} As long as your job parallelism is very small compared to the max parallelism and we have a lot of divisors the algorithm has a lot of flexibility even with small scale factors. {noformat} Yes, I agree this makes sense. Pairing it with vertex max and a high overall max-parallelism could essentially trick the current algo into working. I would argue that a current parallelism 40 is not very close to the max parallelism of 120, though. Maybe our patterns are outside the norm? But to me this seems well within a "normal" range. Is there any reason why we wouldn't want to adjust the algorithm? To my eyes, it has a flaw in that when a scale is _requested_ it may not _actually_ scale because it does not take into account the current bounds, i.e. {noformat} On scale down, ensure that p < currentParallelism and on scale up p > currentParallelism.{noformat} Without such a check, it is very likely that the loop in question will find p == currentParallelism and then maxParallelism % p == 0 will return true, resulting in no action being taken. Looking at the goals of the algorithm, it seems designed to _try its best_ to find a p such that [max % p == 0|https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296-L303], but if it fails it should still return p ([here|https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L305-L306]). I think a simple check ensuring that p != currentParallelism in the keygroup optimization loop could let it optimize without deadlocking. Or perhaps I'm misunderstanding the goal. I would be happy to send a PR over with a slightly tweaked algorithm if you're open to adjusting this slightly. was (Author: trystan): {noformat} As long as your job parallelism is very small compared to the max parallelism and we have a lot of divisors the algorithm has a lot of flexibility even with small scale factors. {noformat} Yes, I agree this makes sense. Pairing it with vertex max and a high overall max-parallelism could essentially trick the current algo into working. I would argue that a current parallelism 40 is not very close to the max parallelism of 120, though. Maybe our patterns are outside the norm? But to me this seems well within a "normal" range. Is there any reason why we wouldn't want to adjust the algorithm? To my eyes, it has a flaw in that when a scale is _requested_ it may not _actually_ scale because it does not take into account the current bounds, i.e. {noformat} On scale down, ensure that p < currentParallelism and on scale up p > currentParallelism.{noformat} Without such a check, it is very likely that the loop in question will find p == currentParallelism and then maxParallelism % p == 0 will return true, resulting in no action being taken. Looking at the goals of the algorithm, it seems designed to _try its best_ to find a p such that [max % p == 0|https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296-L303], but if it fails it should still return p ([here|https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L305-L306]). I think a simple check ensuring that p != currentParallelism could let it optimize without deadlocking. Or perhaps I'm misunderstanding the goal. I would be happy to send a PR over with a slightly tweaked algorithm if you're open to adjusting this slightly. > Autoscaler key group optimization can interfere with scale-down.max-factor > -------------------------------------------------------------------------- > > Key: FLINK-35285 > URL: https://issues.apache.org/jira/browse/FLINK-35285 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Reporter: Trystan > Priority: Minor > > When setting a less aggressive scale down limit, the key group optimization > can prevent a vertex from scaling down at all. It will hunt from target > upwards to maxParallelism/2, and will always find currentParallelism again. > > A simple test trying to scale down from a parallelism of 60 with a > scale-down.max-factor of 0.2: > {code:java} > assertEquals(48, JobVertexScaler.scale(60, inputShipStrategies, 360, .8, 8, > 360)); {code} > > It seems reasonable to make a good attempt to spread data across subtasks, > but not at the expense of total deadlock. The problem is that during scale > down it doesn't actually ensure that newParallelism will be < > currentParallelism. The only workaround is to set a scale down factor large > enough such that it finds the next lowest divisor of the maxParallelism. > > Clunky, but something to ensure it can make at least some progress. There is > another test that now fails, but just to illustrate the point: > {code:java} > for (int p = newParallelism; p <= maxParallelism / 2 && p <= upperBound; p++) > { > if ((scaleFactor < 1 && p < currentParallelism) || (scaleFactor > 1 && p > > currentParallelism)) { > if (maxParallelism % p == 0) { > return p; > } > } > } {code} > > Perhaps this is by design and not a bug, but total failure to scale down in > order to keep optimized key groups does not seem ideal. > > Key group optimization block: > [https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296C1-L303C10] -- This message was sent by Atlassian Jira (v8.20.10#820010)