> On May 28, 2021, at 4:12 PM, Dario Faggioli <dfaggi...@suse.com> wrote: > > A !runnable unit (temporarily) present in the runq may cause us to > stop scanning the runq itself too early. Of course, we don't run any > non-runnable vCPUs, but we end the scan and we fallback to picking > the idle unit. In other word, this prevent us to find there and pick > the actual unit that we're meant to start running (which might be > further ahead in the runq). > > Depending on the vCPU pinning configuration, this may lead to such > unit to be stuck in the runq for long time, causing malfunctioning > inside the guest. > > Fix this by checking runnable/non-runnable status up-front, in the runq > scanning function. > > Reported-by: Michał Leszczyński <michal.leszczyn...@cert.pl> > Reported-by: Dion Kant <g.w.k...@hunenet.nl> > Signed-off-by: Dario Faggioli <dfaggi...@suse.com>
Thanks for tracking this down, Dario! Reviewed-by: George Dunlap <george.dun...@citrix.com> Just one comment: > @@ -3496,8 +3500,7 @@ runq_candidate(struct csched2_runqueue_data *rqd, > * some budget, then choose it. > */ > if ( (yield || svc->credit > snext->credit) && > - (!has_cap(svc) || unit_grab_budget(svc)) && > - unit_runnable_state(svc->unit) ) > + (!has_cap(svc) || unit_grab_budget(svc)) ) > snext = svc; By the same logic, shouldn’t we also move the `(!has_cap() …)` clause into a separate `if(x) continue` clause? There may be runnable units further down the queue which aren’t capped / haven’t exhausted their budget yet. -George