[jira] [Updated] (YUNIKORN-3092) Reservations can permanently block nodes, leading to preemption failure and a stuck scheduler state

John Daciuk (Jira) Mon, 23 Jun 2025 21:20:38 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Daciuk updated YUNIKORN-3092:
----------------------------------
    Description: 
h2. Context

Since deploying Yunikorn back in October 2024 we've encountered occasional 
preemption misses. We find a high priority pod pending for hours, manually 
delete a low priority pod, then see the high priority pod schedule. 

We dug into this earlier this year and found the upgrade from 1.5.2 to 1.6.2 
helpful. In particular [this 
PR|https://github.com/apache/yunikorn-core/pull/1001] achieved 100% expected 
preemption in our testing due to it's reservation removal logic. However we 
still find that 1.6.2 is not reliable with respect to preemption in practice. 
And we can repro preemption misses by scaling up our original preemption load 
test by 4x.
h2. Repro

With Yunikorn 1.6.3 schedule ~400 low priority pods that live forever and fill 
up all node capacity. Once they are running, schedule the same number of high 
priority pods to a different queue. Use the same resources for all the pods. 

We expect that all the high priority pods will eventually schedule. However we 
find about 10% of them stuck pending. This can be seen in the screenshot 
attached, where the high priority pods are tier0.

If we add logging like in [this diff from 
branch-1.6|https://github.com/apache/yunikorn-core/compare/branch-1.6...jdaciuk:yunikorn-core:jdaciuk-1.6]
 we see
{quote}{{{}2025-06-23T04:54:26.776Z    INFO    core.scheduler.preemption    
objects/preemption.go:93    Removing node ip-100-76-60-239.ec2.internal from 
consideration. node.IsReserved: true, node reservations: 
map[847f05e1-f74c-403a-8033-154cd76d89c0:ip-100-76-60-239.ec2.internal -> 
tier0-1-395-157140|847f05e1-f74c-403a-8033-154cd76d89c0], node fits ask: true   
 {"applicationID": "tier0-1-406-328120", "allocationKey": 
"e589c683-faf1-4793-97b8-c5f3b3bc34b5", "author": "MLP"{}}}}
{quote}
A node (with tons of potential victims) ip-100-76-60-239.ec2.internal is 
removed from consideration for preemption because it's reserved. Looking at the 
reservation map above, we see that pod tier0-1-395 has the reservation.

The pod tier0-1-395 is blocking the entire node. Why can't it schedule and 
release the reservation?
{quote}{{{}2025-06-23T04:43:45.942Z    INFO    core.scheduler.application    
objects/application.go:1008    tryAllocate did not find a candidate allocation 
in the node iterator, allowPreemption: true, preemptAttemptsRemaining: 0    
{"applicationID": "tier0-1-395-157140", "author": "MLP"{}}}}
{quote}
Because there's no more preemption attempts allowed for the particular queue 
this cycle. And unfortunately this situation repeats itself every cycle since 
pod tier0-1-395 is not among the first in the queue to ever tryAllocate.
h2. Thoughts

This is one example, but there are a number of ways we can get stuck in such a 
cycle. Particular to the preemption failure here, it seems like we need some 
way to either remove the dead reservation or ignore it will considering 
preemption victims.

So for example, when we iterate through the nodes in [this 
code|https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/preemption.go#L163],
 perhaps we could first try with filtering out reserved nodes (as the code is) 
then try another loop ignoring and/or breaking reservations if we find victims. 
Would ignoring the reservation be enough, or do we have to delete it for the 
preemption to then result in scheduling?

We'd love to get feedback as to the following
 * Is passing a test like described above even a goal of Yunikorn preemption? 
 * If so, how can we be more strategic about releasing reservations that become 
major blockers, esp. in the preemption context?
 * We don't suppose there's a simple way to opt out of the reservation feature 
altogether is there? We don't ever want a reservation to block a node. If the 
pod can't schedule in the current cycle, we'd like it to wait without a 
reservation (in our case a full node will always free up at some point all at 
once). Or is there something we're misunderstanding that makes us need 
reservations?

  was:
h2. Context

Since deploying Yunikorn back in October 2024 we've encountered occasional 
preemption misses. We find high priority pods pending for hours, manually 
delete a low priority pod then see the high priority pod schedule. 

We dug into this earlier this year and found the upgrade from 1.5.2 to 1.6.2 
helpful. In particular [this 
PR|https://github.com/apache/yunikorn-core/pull/1001] achieved 100% expected 
preemption in our testing due to it's reservation removal logic. However we 
still find that 1.6.2 is not reliable with respect to preemption.
h2. Repro

With Yunikorn 1.6.3 schedule ~400 low priority pods that live forever and fill 
up all node capacity. Once they are running, schedule the same number of high 
priority pods to a different queue. Use the same resources for all the pods. 

We expect that all the high priority pods will eventually schedule. However we 
find about 10% of them stuck pending. This can be seen in the screenshot 
attached, where the high priority pods are tier0.

If we add logging like in [this diff from 
branch-1.6|https://github.com/apache/yunikorn-core/compare/branch-1.6...jdaciuk:yunikorn-core:jdaciuk-1.6]
 we see
{quote}{{2025-06-23T04:54:26.776Z    INFO    core.scheduler.preemption    
objects/preemption.go:93    Removing node ip-100-76-60-239.ec2.internal from 
consideration. node.IsReserved: true, node reservations: 
map[847f05e1-f74c-403a-8033-154cd76d89c0:ip-100-76-60-239.ec2.internal -> 
tier0-1-395-157140|847f05e1-f74c-403a-8033-154cd76d89c0], node fits ask: true   
 \{"applicationID": "tier0-1-406-328120", "allocationKey": 
"e589c683-faf1-4793-97b8-c5f3b3bc34b5", "author": "MLP"}}}
{quote}
A node (with tons of potential victims) ip-100-76-60-239.ec2.internal is 
removed from consideration for preemption because it's reserved. Looking at the 
reservation map above, we see that pod tier0-1-395 has the reservation.

The pod tier0-1-395 is blocking the entire node. Why can't it schedule and 
release the reservation?
{quote}{{2025-06-23T04:43:45.942Z    INFO    core.scheduler.application    
objects/application.go:1008    tryAllocate did not find a candidate allocation 
in the node iterator, allowPreemption: true, preemptAttemptsRemaining: 0    
\{"applicationID": "tier0-1-395-157140", "author": "MLP"}}}
{quote}
Because there's no more preemption attempts allowed for the particular queue 
this cycle. And unfortunately this situation repeats itself every cycle since 
pod tier0-1-395 is not among the first in the queue to ever tryAllocate.
h2. Thoughts

This is one example, but there are a number of ways we can get stuck in such a 
cycle. Particular to the preemption failure here, it seems like we need some 
way to either remove the dead reservation or ignore it will considering 
preemption victims.

So for example, when we iterate through the nodes in [this 
code|https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/preemption.go#L163],
 perhaps we could first try with filtering out reserved nodes (as the code is) 
then try another loop ignoring and/or breaking reservations if we find victims. 
Would ignoring the reservation be enough, or do we have to delete it for the 
preemption to then result in scheduling?

We'd love to get feedback as to the following
 * Is passing a test like described above even a goal of Yunikorn preemption? 
 * If so, how can we be more strategic about releasing reservations that become 
major blockers, esp. in the preemption context?
 * We don't suppose there's a simple way to opt out of the reservation feature 
altogether is there? We don't ever want a reservation to block a node. If the 
pod can't schedule in the current cycle, we'd like it to wait without a 
reservation (in our case a full node will always free up at some point all at 
once). Or is there something we're misunderstanding that makes us need 
reservations?


> Reservations can permanently block nodes, leading to preemption failure and a 
> stuck scheduler state
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3092
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3092
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.6.3
>            Reporter: John Daciuk
>            Priority: Minor
>              Labels: preemption
>             Fix For: 1.6.3
>
>         Attachments: Screenshot 2025-06-23 at 8.28.55 PM.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h2. Context
> Since deploying Yunikorn back in October 2024 we've encountered occasional 
> preemption misses. We find a high priority pod pending for hours, manually 
> delete a low priority pod, then see the high priority pod schedule. 
> We dug into this earlier this year and found the upgrade from 1.5.2 to 1.6.2 
> helpful. In particular [this 
> PR|https://github.com/apache/yunikorn-core/pull/1001] achieved 100% expected 
> preemption in our testing due to it's reservation removal logic. However we 
> still find that 1.6.2 is not reliable with respect to preemption in practice. 
> And we can repro preemption misses by scaling up our original preemption load 
> test by 4x.
> h2. Repro
> With Yunikorn 1.6.3 schedule ~400 low priority pods that live forever and 
> fill up all node capacity. Once they are running, schedule the same number of 
> high priority pods to a different queue. Use the same resources for all the 
> pods. 
> We expect that all the high priority pods will eventually schedule. However 
> we find about 10% of them stuck pending. This can be seen in the screenshot 
> attached, where the high priority pods are tier0.
> If we add logging like in [this diff from 
> branch-1.6|https://github.com/apache/yunikorn-core/compare/branch-1.6...jdaciuk:yunikorn-core:jdaciuk-1.6]
>  we see
> {quote}{{{}2025-06-23T04:54:26.776Z    INFO    core.scheduler.preemption    
> objects/preemption.go:93    Removing node ip-100-76-60-239.ec2.internal from 
> consideration. node.IsReserved: true, node reservations: 
> map[847f05e1-f74c-403a-8033-154cd76d89c0:ip-100-76-60-239.ec2.internal -> 
> tier0-1-395-157140|847f05e1-f74c-403a-8033-154cd76d89c0], node fits ask: true 
>    {"applicationID": "tier0-1-406-328120", "allocationKey": 
> "e589c683-faf1-4793-97b8-c5f3b3bc34b5", "author": "MLP"{}}}}
> {quote}
> A node (with tons of potential victims) ip-100-76-60-239.ec2.internal is 
> removed from consideration for preemption because it's reserved. Looking at 
> the reservation map above, we see that pod tier0-1-395 has the reservation.
> The pod tier0-1-395 is blocking the entire node. Why can't it schedule and 
> release the reservation?
> {quote}{{{}2025-06-23T04:43:45.942Z    INFO    core.scheduler.application    
> objects/application.go:1008    tryAllocate did not find a candidate 
> allocation in the node iterator, allowPreemption: true, 
> preemptAttemptsRemaining: 0    {"applicationID": "tier0-1-395-157140", 
> "author": "MLP"{}}}}
> {quote}
> Because there's no more preemption attempts allowed for the particular queue 
> this cycle. And unfortunately this situation repeats itself every cycle since 
> pod tier0-1-395 is not among the first in the queue to ever tryAllocate.
> h2. Thoughts
> This is one example, but there are a number of ways we can get stuck in such 
> a cycle. Particular to the preemption failure here, it seems like we need 
> some way to either remove the dead reservation or ignore it will considering 
> preemption victims.
> So for example, when we iterate through the nodes in [this 
> code|https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/preemption.go#L163],
>  perhaps we could first try with filtering out reserved nodes (as the code 
> is) then try another loop ignoring and/or breaking reservations if we find 
> victims. Would ignoring the reservation be enough, or do we have to delete it 
> for the preemption to then result in scheduling?
> We'd love to get feedback as to the following
>  * Is passing a test like described above even a goal of Yunikorn preemption? 
>  * If so, how can we be more strategic about releasing reservations that 
> become major blockers, esp. in the preemption context?
>  * We don't suppose there's a simple way to opt out of the reservation 
> feature altogether is there? We don't ever want a reservation to block a 
> node. If the pod can't schedule in the current cycle, we'd like it to wait 
> without a reservation (in our case a full node will always free up at some 
> point all at once). Or is there something we're misunderstanding that makes 
> us need reservations?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-3092) Reservations can permanently block nodes, leading to preemption failure and a stuck scheduler state

Reply via email to