[
https://issues.apache.org/jira/browse/YUNIKORN-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986019#comment-17986019
]
Weiwei Yang commented on YUNIKORN-3092:
---------------------------------------
{quote}
In other words, after DISABLE_RESERVATION="true", we still hit the log line
"reservation added successfully". This looks to be 1:1 with preemptions, so I
guess a pod that preempts still gets a reservation? I'm curious why that is.
Are we trying to guarantee that the preemptor can schedule to the node it
cleared?
{quote}
That is correct. in order to make sure the preemptor get the resource, it
reserves the spot on a node, thats why you still see that line gets called, and
that reservation is trigger
[here|https://github.com/apache/yunikorn-core/blob/1a07ad75fba01986c3cd56ce97b9eef3d1876ae0/pkg/scheduler/objects/preemption.go#L658],
this is in "tryPreempt". What we have disabled is the reservation during
"tryAllocate" phase, which means when the scheduler can't find a slot for the
pod in normal scheduling cycles, do not reserve a node for it, just keep
trying.
> Reservations can permanently block nodes, leading to preemption failure and a
> stuck scheduler state
> ---------------------------------------------------------------------------------------------------
>
> Key: YUNIKORN-3092
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3092
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 1.6.3
> Reporter: John Daciuk
> Priority: Minor
> Labels: preemption
> Fix For: 1.6.3
>
> Attachments: Screenshot 2025-06-23 at 8.28.55 PM.png
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> h2. Context
> Since deploying Yunikorn back in October 2024 we've encountered occasional
> preemption misses. We find a high priority pod pending for hours, manually
> delete a low priority pod, then see the high priority pod schedule.
> We dug into this earlier this year and found the upgrade from 1.5.2 to 1.6.2
> helpful. In particular [this
> PR|https://github.com/apache/yunikorn-core/pull/1001] achieved 100% expected
> preemption in our testing due to it's reservation removal logic. However we
> still find that 1.6.2 is not reliable with respect to preemption in practice.
> And we can repro preemption misses by scaling up our original preemption load
> test by 4x.
> h2. Repro
> With Yunikorn 1.6.3 schedule ~400 low priority pods that live forever and
> fill up all node capacity. Once they are running, schedule the same number of
> high priority pods to a different queue. Use the same resources for all the
> pods.
> We expect that all the high priority pods will eventually schedule. However
> we find about 10% of them stuck pending. This can be seen in the screenshot
> attached, where the high priority pods are tier0.
> If we add logging like in [this diff from
> branch-1.6|https://github.com/apache/yunikorn-core/compare/branch-1.6...jdaciuk:yunikorn-core:jdaciuk-1.6]
> we see
> {quote}{{{}2025-06-23T04:54:26.776Z INFO core.scheduler.preemption
> objects/preemption.go:93 Removing node ip-100-76-60-239.ec2.internal from
> consideration. node.IsReserved: true, node reservations:
> map[847f05e1-f74c-403a-8033-154cd76d89c0:ip-100-76-60-239.ec2.internal ->
> tier0-1-395-157140|847f05e1-f74c-403a-8033-154cd76d89c0], node fits ask: true
> {"applicationID": "tier0-1-406-328120", "allocationKey":
> "e589c683-faf1-4793-97b8-c5f3b3bc34b5", "author": "MLP"{}}}}
> {quote}
> A node (with tons of potential victims) ip-100-76-60-239.ec2.internal is
> removed from consideration for preemption because it's reserved. Looking at
> the reservation map above, we see that pod tier0-1-395 has the reservation.
> The pod tier0-1-395 is blocking the entire node. Why can't it schedule and
> release the reservation?
> {quote}{{{}2025-06-23T04:43:45.942Z INFO core.scheduler.application
> objects/application.go:1008 tryAllocate did not find a candidate
> allocation in the node iterator, allowPreemption: true,
> preemptAttemptsRemaining: 0 {"applicationID": "tier0-1-395-157140",
> "author": "MLP"{}}}}
> {quote}
> Because there's no more preemption attempts allowed for the particular queue
> this cycle. And unfortunately this situation repeats itself every cycle since
> pod tier0-1-395 is not among the first in the queue to ever tryAllocate.
> h2. Thoughts
> This is one example, but there are a number of ways we can get stuck in such
> a cycle. Particular to the preemption failure here, it seems like we need
> some way to either remove the dead reservation or ignore it while considering
> preemption victims.
> So for example, when we iterate through the nodes in [this
> code|https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/preemption.go#L163],
> perhaps we could first try with filtering out reserved nodes (as the code
> is) then try another loop ignoring and/or breaking reservations if we find
> victims. Would ignoring the reservation be enough, or do we have to delete it
> for the preemption to then result in scheduling?
> We'd love to get feedback as to the following
> * Is passing a test like described above even a goal of Yunikorn preemption?
> * If so, how can we be more strategic about releasing reservations that
> become major blockers, esp. in the preemption context?
> * We don't suppose there's a simple way to opt out of the reservation
> feature altogether is there? We don't ever want a reservation to block a
> node. If the pod can't schedule in the current cycle, we'd like it to wait
> without a reservation (in our case a full node will always free up at some
> point all at once). Or is there something we're misunderstanding that makes
> us need reservations?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]