[ 
https://issues.apache.org/jira/browse/FLINK-35285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868093#comment-17868093
 ] 

Trystan edited comment on FLINK-35285 at 7/23/24 3:00 PM:
----------------------------------------------------------

{noformat}
As long as your job parallelism is very small compared to the max parallelism 
and we have a lot of divisors the algorithm has a lot of flexibility even with 
small scale factors. {noformat}
Yes, I agree this makes sense. Pairing it with vertex max and a high overall 
max-parallelism could essentially trick the current algo into working.

I would argue that a current parallelism 40 is not very close to the max 
parallelism of 120, though. Maybe our patterns are outside the norm? But to me 
this seems well within a "normal" range.

 

Is there any reason why we wouldn't want to adjust the algorithm? To my eyes, 
it has a flaw in that when a scale is _requested_ it may not _actually_ scale 
because it does not take into account the current bounds, i.e.
{noformat}
On scale down, ensure that p < currentParallelism and on scale up p > 
currentParallelism.{noformat}
Without such a check, it is very likely that the loop in question will find p 
== currentParallelism and then maxParallelism % p == 0 will return true, 
resulting in no action being taken. Looking at the goals of the algorithm, it 
seems designed to _try its best_ to find a p such that [max % p == 
0|https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296-L303],
 but if it fails it should still return p ([here|#L296-L303]).]) I think a 
simple check ensuring that p != currentParallelism could let it optimize 
without deadlocking.

 

Or perhaps I'm misunderstanding the goal. I would be happy to send a PR over 
with a slightly tweaked algorithm if you're open to adjusting this slightly.


was (Author: trystan):
 
{noformat}
As long as your job parallelism is very small compared to the max parallelism 
and we have a lot of divisors the algorithm has a lot of flexibility even with 
small scale factors. {noformat}
 

 

Yes, I agree this makes sense. Pairing it with vertex max and a high overall 
max-parallelism could essentially trick the current algo into working.

 

I would argue that a current parallelism 40 is not very close to the max 
parallelism of 120, though. Maybe our patterns are outside the norm? But to me 
this seems well within a "normal" range.

 

Is there any reason why we wouldn't want to adjust the algorithm? To my eyes, 
it has a flaw in that when a scale is _requested_ it may not _actually_ scale 
because it does not take into account the current bounds, i.e.
{noformat}
On scale down, ensure that p < currentParallelism and on scale up p > 
currentParallelism.{noformat}
Without such a check, it is very likely that the loop in question will find p 
== currentParallelism and then maxParallelism % p == 0 will return true, 
resulting in no action being taken. Looking at the goals of the algorithm, it 
seems designed to _try its best_ to find a p such that [max % p == 
0|https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296-L303],
 but if it fails it should still return p 
([here|[https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296-L303]).]
 I think a simple check ensuring that p != currentParallelism could let it 
optimize without deadlocking.

 

Or perhaps I'm misunderstanding the goal. I would be happy to send a PR over 
with a slightly tweaked algorithm if you're open to adjusting this slightly.

> Autoscaler key group optimization can interfere with scale-down.max-factor
> --------------------------------------------------------------------------
>
>                 Key: FLINK-35285
>                 URL: https://issues.apache.org/jira/browse/FLINK-35285
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Trystan
>            Priority: Minor
>
> When setting a less aggressive scale down limit, the key group optimization 
> can prevent a vertex from scaling down at all. It will hunt from target 
> upwards to maxParallelism/2, and will always find currentParallelism again.
>  
> A simple test trying to scale down from a parallelism of 60 with a 
> scale-down.max-factor of 0.2:
> {code:java}
> assertEquals(48, JobVertexScaler.scale(60, inputShipStrategies, 360, .8, 8, 
> 360)); {code}
>  
> It seems reasonable to make a good attempt to spread data across subtasks, 
> but not at the expense of total deadlock. The problem is that during scale 
> down it doesn't actually ensure that newParallelism will be < 
> currentParallelism. The only workaround is to set a scale down factor large 
> enough such that it finds the next lowest divisor of the maxParallelism.
>  
> Clunky, but something to ensure it can make at least some progress. There is 
> another test that now fails, but just to illustrate the point:
> {code:java}
> for (int p = newParallelism; p <= maxParallelism / 2 && p <= upperBound; p++) 
> {
>     if ((scaleFactor < 1 && p < currentParallelism) || (scaleFactor > 1 && p 
> > currentParallelism)) {
>         if (maxParallelism % p == 0) {
>             return p;
>         }
>     }
> } {code}
>  
> Perhaps this is by design and not a bug, but total failure to scale down in 
> order to keep optimized key groups does not seem ideal.
>  
> Key group optimization block:
> [https://github.com/apache/flink-kubernetes-operator/blob/fe3d24e4500d6fcaed55250ccc816546886fd1cf/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java#L296C1-L303C10]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to