[ 
https://issues.apache.org/jira/browse/YUNIKORN-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035159#comment-18035159
 ] 

Jian Chen commented on YUNIKORN-3148:
-------------------------------------

There's another issue in `GetQueueOutstandingRequests` that could lead to 
over-provisioning by the underlying cluster auto-scaler. 
{code:java}
func (sq *Queue) GetQueueOutstandingRequests(total *[]*Allocation) {
    if sq.IsLeafQueue() {
      //
      // issue discussed in this JIRA
      //
    } else {
       for _, child := range sq.sortQueues() {
          child.GetQueueOutstandingRequests(total)
       }
    }
} {code}
Each child is getting its own outstanding requests independently without any 
coordination, this could lead to an overcounting of available headroom under a 
parent queue. For example, 
{code:java}
parent_queue:
  allocated: 40TB
  max: 60TB
  headroom: 60 - 40 = 20TB

child_1:
  allocated: 20TB
  max: 50TB
  headroom: min(30TB, parent_headroom) = 20TB

child_2:
  allocated: 20TB
  max: 50TB
  headroom: min(30TB, parent_headroom) = 20TB

The total headroom at this point is 40TB (> 20TB at the parent level){code}
This implies that a cluster will potentially autoscale on the sum of max 
capacity of all queues, bypassing parent queue's max boundary (but those extra 
capacity will not be used). 

Attached an unit test case on this

[^queue_outstanding_requests_test.go]

 

> Incorrect headroom calculation when collecting outstanding requests
> -------------------------------------------------------------------
>
>                 Key: YUNIKORN-3148
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3148
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Blocker
>         Attachments: queue_outstanding_requests_test.go
>
>
>  YUNIKORN-2794 introduced a bug in the Application code. We're no longer 
> mutating the {{headRoom}} object in-place in 
> {{{}Application.getOutstandingRequests(){}}}, instead, we re-assign its value 
> to a newly created object.
> Before:
> {noformat}
> headRoom.SubOnlyExisting(request.GetAllocatedResource())
> userHeadRoom.SubOnlyExisting(request.GetAllocatedResource()) {noformat}
> After:
> {noformat}
> headRoom = resources.SubOnlyExisting(headRoom, request.GetAllocatedResource())
> userHeadRoom = resources.SubOnlyExisting(userHeadRoom, 
> request.GetAllocatedResource())  {noformat}
> Problem is, this does not change the object pointed to outside the function, 
> so every iteration in {{Queue.GetQueueOutstandingRequests()}} starts with the 
> original {{headRoom}} value for every application. This leads to undesired 
> behavior, because we'll end up collecting more asks than needed, which in 
> turn triggers unnecessary cluster upscale.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to