On 3/20/24 07:04, Tobias Huschle wrote: > On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote: >> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <husc...@linux.ibm.com> wrote: >>> >>> On 2024-03-18 15:45, Luis Machado wrote: >>>> On 3/14/24 13:45, Tobias Huschle wrote: >>>>> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote: >>>>>> On 2/28/24 16:10, Tobias Huschle wrote: >>>>>>> >>>>>>> Questions: >>>>>>> 1. The kworker getting its negative lag occurs in the following >>>>>>> scenario >>>>>>> - kworker and a cgroup are supposed to execute on the same CPU >>>>>>> - one task within the cgroup is executing and wakes up the >>>>>>> kworker >>>>>>> - kworker with 0 lag, gets picked immediately and finishes its >>>>>>> execution within ~5000ns >>>>>>> - on dequeue, kworker gets assigned a negative lag >>>>>>> Is this expected behavior? With this short execution time, I >>>>>>> would >>>>>>> expect the kworker to be fine. >>>>>> >>>>>> That strikes me as a bit odd as well. Have you been able to determine >>>>>> how a negative lag >>>>>> is assigned to the kworker after such a short runtime? >>>>>> >>>>> >>>>> I did some more trace reading though and found something. >>>>> >>>>> What I observed if everything runs regularly: >>>>> - vhost and kworker run alternating on the same CPU >>>>> - if the kworker is done, it leaves the runqueue >>>>> - vhost wakes up the kworker if it needs it >>>>> --> this means: >>>>> - vhost starts alone on an otherwise empty runqueue >>>>> - it seems like it never gets dequeued >>>>> (unless another unrelated task joins or migration hits) >>>>> - if vhost wakes up the kworker, the kworker gets selected >>>>> - vhost runtime > kworker runtime >>>>> --> kworker gets positive lag and gets selected immediately next >>>>> time >>>>> >>>>> What happens if it does go wrong: >>>>> From what I gather, there seem to be occasions where the vhost either >>>>> executes suprisingly quick, or the kworker surprinsingly slow. If >>>>> these >>>>> outliers reach critical values, it can happen, that >>>>> vhost runtime < kworker runtime >>>>> which now causes the kworker to get the negative lag. >>>>> >>>>> In this case it seems like, that the vhost is very fast in waking up >>>>> the kworker. And coincidentally, the kworker takes, more time than >>>>> usual >>>>> to finish. We speak of 4-digit to low 5-digit nanoseconds. >>>>> >>>>> So, for these outliers, the scheduler extrapolates that the kworker >>>>> out-consumes the vhost and should be slowed down, although in the >>>>> majority >>>>> of other cases this does not happen. >>>> >>>> Thanks for providing the above details Tobias. It does seem like EEVDF >>>> is strict >>>> about the eligibility checks and making tasks wait when their lags are >>>> negative, even >>>> if just a little bit as in the case of the kworker. >>>> >>>> There was a patch to disable the eligibility checks >>>> (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), >>>> which would make EEVDF more like EVDF, though the deadline comparison >>>> would >>>> probably still favor the vhost task instead of the kworker with the >>>> negative lag. >>>> >>>> I'm not sure if you tried it, but I thought I'd mention it. >>> >>> Haven't seen that one yet. Unfortunately, it does not help to ignore the >>> eligibility. >>> >>> I'm inclined to rather propose propose a documentation change, which >>> describes that tasks should not rely on woken up tasks being scheduled >>> immediately. >> >> Where do you see such an assumption ? Even before eevdf, there were >> nothing that ensures such behavior. When using CFS (legacy or eevdf) >> tasks, you can't know if the newly wakeup task will run 1st or not >> > > There was no guarantee of course. place_entity was reducing the vruntime of > woken up tasks though, giving it a slight boost, right?. For the scenario > that I observed, that boost was enough to make sure, that the woken up tasks > gets scheduled consistently. This might still not be true for all scenarios, > but in general EEVDF seems to be stricter with woken up tasks.
It seems that way, as EEVDF will do eligibility and deadline checks before scheduling a task, so a task would have to satisfy both of those checks. I think we have some special treatment for when a task initially joins the competition, in which case we halve its slice. But I don't think there is any special treatment for woken tasks anymore. There was also a fix (63304558ba5dcaaff9e052ee43cfdcc7f9c29e85) to try to reduce the number of wake up preemptions under some conditions, under the RUN_TO_PARITY feature.