[jira] [Commented] (YUNIKORN-3092) Reservations can permanently block nodes, leading to preemption failure and a stuck scheduler state

Weiwei Yang (Jira) Tue, 24 Jun 2025 23:42:44 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986019#comment-17986019
 ]


Weiwei Yang commented on YUNIKORN-3092:
---------------------------------------

{quote}
In other words, after DISABLE_RESERVATION="true", we still hit the log line 
"reservation added successfully". This looks to be 1:1 with preemptions, so I 
guess a pod that preempts still gets a reservation? I'm curious why that is. 
Are we trying to guarantee that the preemptor can schedule to the node it 
cleared? 
{quote}

That is correct. in order to make sure the preemptor get the resource, it 
reserves the spot on a node, thats why you still see that line gets called, and 
that reservation is trigger 
[here|https://github.com/apache/yunikorn-core/blob/1a07ad75fba01986c3cd56ce97b9eef3d1876ae0/pkg/scheduler/objects/preemption.go#L658],
 this is in "tryPreempt". What we have disabled is the reservation during 
"tryAllocate" phase, which means when the scheduler can't find a slot for the 
pod in normal scheduling cycles, do not reserve a node for it, just keep 
trying. 



> Reservations can permanently block nodes, leading to preemption failure and a 
> stuck scheduler state
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3092
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3092
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.6.3
>            Reporter: John Daciuk
>            Priority: Minor
>              Labels: preemption
>             Fix For: 1.6.3
>
>         Attachments: Screenshot 2025-06-23 at 8.28.55 PM.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h2. Context
> Since deploying Yunikorn back in October 2024 we've encountered occasional 
> preemption misses. We find a high priority pod pending for hours, manually 
> delete a low priority pod, then see the high priority pod schedule. 
> We dug into this earlier this year and found the upgrade from 1.5.2 to 1.6.2 
> helpful. In particular [this 
> PR|https://github.com/apache/yunikorn-core/pull/1001] achieved 100% expected 
> preemption in our testing due to it's reservation removal logic. However we 
> still find that 1.6.2 is not reliable with respect to preemption in practice. 
> And we can repro preemption misses by scaling up our original preemption load 
> test by 4x.
> h2. Repro
> With Yunikorn 1.6.3 schedule ~400 low priority pods that live forever and 
> fill up all node capacity. Once they are running, schedule the same number of 
> high priority pods to a different queue. Use the same resources for all the 
> pods. 
> We expect that all the high priority pods will eventually schedule. However 
> we find about 10% of them stuck pending. This can be seen in the screenshot 
> attached, where the high priority pods are tier0.
> If we add logging like in [this diff from 
> branch-1.6|https://github.com/apache/yunikorn-core/compare/branch-1.6...jdaciuk:yunikorn-core:jdaciuk-1.6]
>  we see
> {quote}{{{}2025-06-23T04:54:26.776Z    INFO    core.scheduler.preemption    
> objects/preemption.go:93    Removing node ip-100-76-60-239.ec2.internal from 
> consideration. node.IsReserved: true, node reservations: 
> map[847f05e1-f74c-403a-8033-154cd76d89c0:ip-100-76-60-239.ec2.internal -> 
> tier0-1-395-157140|847f05e1-f74c-403a-8033-154cd76d89c0], node fits ask: true 
>    {"applicationID": "tier0-1-406-328120", "allocationKey": 
> "e589c683-faf1-4793-97b8-c5f3b3bc34b5", "author": "MLP"{}}}}
> {quote}
> A node (with tons of potential victims) ip-100-76-60-239.ec2.internal is 
> removed from consideration for preemption because it's reserved. Looking at 
> the reservation map above, we see that pod tier0-1-395 has the reservation.
> The pod tier0-1-395 is blocking the entire node. Why can't it schedule and 
> release the reservation?
> {quote}{{{}2025-06-23T04:43:45.942Z    INFO    core.scheduler.application    
> objects/application.go:1008    tryAllocate did not find a candidate 
> allocation in the node iterator, allowPreemption: true, 
> preemptAttemptsRemaining: 0    {"applicationID": "tier0-1-395-157140", 
> "author": "MLP"{}}}}
> {quote}
> Because there's no more preemption attempts allowed for the particular queue 
> this cycle. And unfortunately this situation repeats itself every cycle since 
> pod tier0-1-395 is not among the first in the queue to ever tryAllocate.
> h2. Thoughts
> This is one example, but there are a number of ways we can get stuck in such 
> a cycle. Particular to the preemption failure here, it seems like we need 
> some way to either remove the dead reservation or ignore it while considering 
> preemption victims.
> So for example, when we iterate through the nodes in [this 
> code|https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/preemption.go#L163],
>  perhaps we could first try with filtering out reserved nodes (as the code 
> is) then try another loop ignoring and/or breaking reservations if we find 
> victims. Would ignoring the reservation be enough, or do we have to delete it 
> for the preemption to then result in scheduling?
> We'd love to get feedback as to the following
>  * Is passing a test like described above even a goal of Yunikorn preemption? 
>  * If so, how can we be more strategic about releasing reservations that 
> become major blockers, esp. in the preemption context?
>  * We don't suppose there's a simple way to opt out of the reservation 
> feature altogether is there? We don't ever want a reservation to block a 
> node. If the pod can't schedule in the current cycle, we'd like it to wait 
> without a reservation (in our case a full node will always free up at some 
> point all at once). Or is there something we're misunderstanding that makes 
> us need reservations?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-3092) Reservations can permanently block nodes, leading to preemption failure and a stuck scheduler state

Reply via email to