John Daciuk created YUNIKORN-3092:
-------------------------------------
Summary: Reservations can permanently block nodes, leading to
preemption failure and a stuck scheduler state
Key: YUNIKORN-3092
URL: https://issues.apache.org/jira/browse/YUNIKORN-3092
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Affects Versions: 1.6.3
Reporter: John Daciuk
Fix For: 1.6.3
Attachments: Screenshot 2025-06-23 at 8.28.55 PM.png
h2. Context
Since deploying Yunikorn back in October 2024 we've encountered occasional
preemption misses. We find high priority pods pending for hours, manually
delete a low priority pod then see the high priority pod schedule.
We dug into this earlier this year and found the upgrade from 1.5.2 to 1.6.2
helpful. In particular [this
PR|https://github.com/apache/yunikorn-core/pull/1001] achieved 100% expected
preemption in our testing due to it's reservation removal logic. However we
still find that 1.6.2 is not reliable with respect to preemption.
h2. Repro
With Yunikorn 1.6.3 schedule ~400 low priority pods that live forever and fill
up all node capacity. Once they are running, schedule the same number of high
priority pods to a different queue. Use the same resources for all the pods.
We expect that all the high priority pods will eventually schedule. However we
find about 10% of them stuck pending. This can be seen in the screenshot
attached, where the high priority pods are tier0.
If we add logging like in [this diff from
branch-1.6|https://github.com/apache/yunikorn-core/compare/branch-1.6...jdaciuk:yunikorn-core:jdaciuk-1.6]
we see
{quote}{{2025-06-23T04:54:26.776Z INFO core.scheduler.preemption
objects/preemption.go:93 Removing node ip-100-76-60-239.ec2.internal from
consideration. node.IsReserved: true, node reservations:
map[847f05e1-f74c-403a-8033-154cd76d89c0:ip-100-76-60-239.ec2.internal ->
tier0-1-395-157140|847f05e1-f74c-403a-8033-154cd76d89c0], node fits ask: true
\{"applicationID": "tier0-1-406-328120", "allocationKey":
"e589c683-faf1-4793-97b8-c5f3b3bc34b5", "author": "MLP"}}}
{quote}
A node (with tons of potential victims) ip-100-76-60-239.ec2.internal is
removed from consideration for preemption because it's reserved. Looking at the
reservation map above, we see that pod tier0-1-395 has the reservation.
The pod tier0-1-395 is blocking the entire node. Why can't it schedule and
release the reservation?
{quote}{{2025-06-23T04:43:45.942Z INFO core.scheduler.application
objects/application.go:1008 tryAllocate did not find a candidate allocation
in the node iterator, allowPreemption: true, preemptAttemptsRemaining: 0
\{"applicationID": "tier0-1-395-157140", "author": "MLP"}}}
{quote}
Because there's no more preemption attempts allowed for the particular queue
this cycle. And unfortunately this situation repeats itself every cycle since
pod tier0-1-395 is not among the first in the queue to ever tryAllocate.
h2. Thoughts
This is one example, but there are a number of ways we can get stuck in such a
cycle. Particular to the preemption failure here, it seems like we need some
way to either remove the dead reservation or ignore it will considering
preemption victims.
So for example, when we iterate through the nodes in [this
code|https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/preemption.go#L163],
perhaps we could first try with filtering out reserved nodes (as the code is)
then try another loop ignoring and/or breaking reservations if we find victims.
Would ignoring the reservation be enough, or do we have to delete it for the
preemption to then result in scheduling?
We'd love to get feedback as to the following
* Is passing a test like described above even a goal of Yunikorn preemption?
* If so, how can we be more strategic about releasing reservations that become
major blockers, esp. in the preemption context?
* We don't suppose there's a simple way to opt out of the reservation feature
altogether is there? We don't ever want a reservation to block a node. If the
pod can't schedule in the current cycle, we'd like it to wait without a
reservation (in our case a full node will always free up at some point all at
once). Or is there something we're misunderstanding that makes us need
reservations?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]