On 10/31/18 5:40 PM, Juri Lelli wrote: > On 31/10/18 17:18, Daniel Bristot de Oliveira wrote: >> On 10/30/18 12:08 PM, luca abeni wrote: >>> Hi Peter, >>> >>> On Tue, 30 Oct 2018 11:45:54 +0100 >>> Peter Zijlstra <pet...@infradead.org> wrote: >>> [...] >>>>> 2. This is related to perf_event_open syscall reproducer does >>>>> before becoming DEADLINE and entering the busy loop. Enabling of >>>>> perf swevents generates lot of hrtimers load that happens in the >>>>> reproducer task context. Now, DEADLINE uses rq_clock() for >>>>> setting deadlines, but rq_clock_task() for doing runtime >>>>> enforcement. In a situation like this it seems that the amount of >>>>> irq pressure becomes pretty big (I'm seeing this on kvm, real hw >>>>> should maybe do better, pain point remains I guess), so rq_clock() >>>>> and rq_clock_task() might become more a more skewed w.r.t. each >>>>> other. Since rq_clock() is only used when setting absolute >>>>> deadlines for the first time (or when resetting them in certain >>>>> cases), after a bit the replenishment code will start to see >>>>> postponed deadlines always in the past w.r.t. rq_clock(). And this >>>>> brings us back to the fact that the task is never stopped, since it >>>>> can't keep up with rq_clock(). >>>>> >>>>> - Not sure yet how we want to address this [1]. We could use >>>>> rq_clock() everywhere, but tasks might be penalized by irq >>>>> pressure (theoretically this would mandate that irqs are >>>>> explicitly accounted for I guess). I tried to use the skew >>>>> between the two clocks to "fix" deadlines, but that puts us at >>>>> risks of de-synchronizing userspace and kernel views of deadlines. >>>> >>>> Hurm.. right. We knew of this issue back when we did it. >>>> I suppose now it hurts and we need to figure something out. >>>> >>>> By virtue of being a real-time class, we do indeed need to have >>>> deadline on the wall-clock. But if we then don't account runtime on >>>> that same clock, but on a potentially slower clock, we get the >>>> problem that we can run longer than our period/deadline, which is >>>> what we're running into here I suppose. >>> >>> I might be hugely misunderstanding something here, but in my impression >>> the issue is just that if the IRQ time is not accounted to the >>> -deadline task, then the non-deadline tasks might be starved. >>> >>> I do not see this as a skew between two clocks, but as an accounting >>> thing: >>> - if we decide that the IRQ time is accounted to the -deadline >>> task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled), >>> then the non-deadline tasks are not starved (but of course the >>> -deadline tasks executes for less than its reserved time in the >>> period); >>> - if we decide that the IRQ time is not accounted to the -deadline task >>> (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then >>> the -deadline task executes for the expected amount of time (about >>> 60% of the CPU time), but an IRQ load of 40% will starve non-deadline >>> tasks (this is what happens in the bug that triggered this discussion) >>> >>> I think this might be seen as an adimission control issue: when >>> CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for >>> in the admission control (because it ends up in the task's runtime), >>> but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not >>> accounted for in the admission test (the IRQ handler becomes some sort >>> of entity with a higher priority than -deadline tasks, on which no >>> accounting or enforcement is performed). >>> >> >> I am sorry for taking to long to join in the discussion. >> >> I agree with Luca. I've seem this behavior two time before. Firstly when we >> were >> trying to make the rt throttling to have a very short runtime for non-rt >> threads, and then in the proof of concept of the semi-partitioned scheduler. >> >> Firstly, I started thinking on this as a skew between both clocks and >> disabled >> IRQ_TIME_ACCOUNTING. But by ignoring IRQ accounting, we are assuming that the >> IRQ runtime will be accounted as the thread's runtime. In other words, we are >> just sweeping the trash under the rug, where the rug is the worst case >> execution >> time estimation/definition (which is an even more complex problem). In the >> Brazilian part of the Ph.D we are dealing with probabilistic worst case >> execution time, and to be able to use probabilistic methods, we need to >> remove >> the noise of the IRQs in the execution time [1]. So, IMHO, using >> CONFIG_IRQ_TIME_ACCOUNTING is a good thing. >> >> The fact that we have barely no control of the execution of IRQs, at first >> glance, let us think that the idea of considering an IRQ as a task seems to >> be >> absurd. But, it is not. The IRQs run a piece of code that is, in the vast >> majority of the case, not related to the current thread, so it runs another >> "task". In the occurrence of more than one IRQ concurrently, the processor >> serves the IRQ in a predictable order [2], so the processor schedules the >> IRQs >> as a "task". Finally, there are precedence constraints among threads and >> IRQs. >> For instance, the latency can be seen as the response time of the timer IRQ >> handler, plus the delta of the return of the handler and the starting of the >> execution of cyclictest [3]. In the theory, the idea of precedence >> constraints >> is also about "task". >> >> So IMHO, IRQs can be considered as a task (I am considering in my model), and >> the place to account this would be in the admission test. >> >> The problem is that, for the best of my knowledge, there is no admissions >> test >> for such task model/system: >> >> Two level of schedulers. A high priority scheduler that schedules a non >> preemptive task set (IRQ) under a fixed priority (the processor scheduler do >> it, >> and on intel it is a fixed priority). A lower priority task set (threads) >> scheduled by the OS. >> >> But assuming that our current admission control is more about a safe guard >> than >> an exact admission control - that is, for multiprocessor it is necessary, but >> not sufficient. (Theoretically, it works for uniprocessor, but... there is a >> paper of Rob Davis somewhere that shows that if we have "context switch" >> (and so >> scheduler for our case)) with different costs, the many things does not hold >> true, for instance, Deadline Monotonic is not optimal... but I will have to >> read >> more to enter in this point, anyway, multiprocessor is only necessary). >> >> With this in mind: we do *not* use/have an exact admission test for all >> cases. >> By not having an exact admission test, we assume the user knows what he/she >> is >> doing. In this case, if they have a high load of IRQs... they need to know >> that: >> >> 1) Their periods should be consistent with the "interference" they might >> receive. >> 2) Their tasks can miss the deadline because of IRQs (and there is no way to >> avoid this without "throttling" IRQs...) >> >> So, is it worth to put a duct tape for this case? >> >> My fear is that, by putting a duct tape here, we would turn things prone to >> more >> complex errors/undeterminism... so... >> >> I think we have another point to add to the discussion at plumbers, Juri. > > Yeah, sure. My fear in a case like this though is that the task that > ends up starving other is "creating" IRQ overhead on itself. Kind of > DoS, no?
I see your point. But how about a non-rt thread that creates a lot of timers, then a DL task arrives and preempts it, receiving the interfere from interrupts that were caused by the previous thread? Actually, enabling/disabling sched stats in a loop generates a IPI storm on all (other) CPUs because of updates in jump labels (we will reduce/bound that with the batch of jump label update, but still, the problem will exist). But not only, iirc we can cause this with a madvise to cause the flush of pages. > I'm seeing something along the lines of what Peter suggested as a last > resort measure we probably still need to put in place. I meant, I am not against the/a fix, i just think that... it is more complicated that it seems. For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating IPIs because of static key update, and a good dl thread B in the CPU 1. In this case, the thread B could run less than what was reserved for it, but it was not causing the interrupts. It is not fair to put a penalty in the thread B. The same is valid for a dl thread running in the same CPU that is receiving a lot of network packets to another application, and other legit cases. In the end, if we want to avoid non-rt threads starving, we need to prioritize them some time, but in this case, we return to the DL server for non-rt threads. Thoughts? Thanks, -- Daniel > Thanks, > > - Juri >