Hi Daniel, 2017-01-04 16:14 GMT+01:00, Daniel Bristot de Oliveira <bris...@redhat.com>: > On 01/04/2017 01:17 PM, luca abeni wrote: >> Hi Daniel, >> >> On Tue, 3 Jan 2017 19:58:38 +0100 >> Daniel Bristot de Oliveira <bris...@redhat.com> wrote: >> >> [...] >>> In a four core box, if I dispatch 11 tasks [1] with setup: >>> >>> period = 30 ms >>> runtime = 10 ms >>> flags = 0 (GRUB disabled) >>> >>> I see this: >>> ------------------------------- HTOP >>> ------------------------------------ 1 >>> [|||||||||||||||||||||92.5%] Tasks: 128, 259 thr; 14 running 2 >>> [|||||||||||||||||||||91.0%] Load average: 4.65 4.66 4.81 3 >>> [|||||||||||||||||||||92.5%] Uptime: 05:12:43 4 >>> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G] >>> Swp[ 0K/3.90G] >>> >>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command >>> 16247 root -101 0 4204 632 564 R 32.4 0.0 2:10.35 d >>> 16249 root -101 0 4204 624 556 R 32.4 0.0 2:09.80 d >>> 16250 root -101 0 4204 728 660 R 32.4 0.0 2:09.58 d >>> 16252 root -101 0 4204 676 608 R 32.4 0.0 2:09.08 d >>> 16253 root -101 0 4204 636 568 R 32.4 0.0 2:08.85 d >>> 16254 root -101 0 4204 732 664 R 32.4 0.0 2:08.62 d >>> 16255 root -101 0 4204 620 556 R 32.4 0.0 2:08.40 d >>> 16257 root -101 0 4204 708 640 R 32.4 0.0 2:07.98 d >>> 16256 root -101 0 4204 624 560 R 32.4 0.0 2:08.18 d >>> 16248 root -101 0 4204 680 612 R 33.0 0.0 2:10.15 d >>> 16251 root -101 0 4204 676 608 R 33.0 0.0 2:09.34 d >>> 16259 root 20 0 124M 4692 3120 R 1.1 0.1 0:02.82 htop >>> 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0 0:28.77 >>> gnome-ter ------------------------------- HTOP >>> ------------------------------------ >>> >>> All tasks are using +- the same amount of CPU time, a little bit more >>> than 30%, as expected. >> >> Notice that, if I understand well, each task should receive 33.33% (1/3) >> of CPU time. Anyway, I think this is ok... > > If we think on a partitioned system, yes for the CPUs in which 3 'd' > tasks are able to run. But as sched deadline is global by definition, > the load is: > > SUM(U_i) / M processors. > > 1/3 * 11 / 4 = 0.916666667 > > So 10/30 (1/3) of this workload is: > 91.6 / 3 = 30.533333333 > > Well, the rest is probably overheads, like scheduling, migration...
I do not think this math is correct... Yes, the total utilization of the taskset is 0.91 (or 3.66, depending on how you define the utilization...), but I still think that the percentage of CPU time shown by "top" or "htop" should be 33.33 (or 8.33, depending on how the tool computes it). runtime=10 and period=30 means "schedule the task for 10ms every 30ms", so the task will consume 33% of the CPU time of a single core. In other words, 10/30 is a fraction of the CPU time, not a fraction of the time consumed by SCHED_DEADLINE tasks. >>> However, if I enable GRUB in the same task set I get this: >>> >>> ------------------------------- HTOP >>> ------------------------------------ 1 >>> [|||||||||||||||||||||93.8%] Tasks: 128, 260 thr; 15 running 2 >>> [|||||||||||||||||||||95.2%] Load average: 5.13 5.01 4.98 3 >>> [|||||||||||||||||||||93.3%] Uptime: 05:01:02 4 >>> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G] >>> Swp[ 0K/3.90G] >>> >>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command >>> 14967 root -101 0 4204 628 564 R 45.8 0.0 1h07:49 g >>> 14962 root -101 0 4204 728 660 R 45.8 0.0 1h05:06 g >>> 14959 root -101 0 4204 680 612 R 45.2 0.0 1h07:29 g >>> 14927 root -101 0 4204 624 556 R 44.6 0.0 1h04:30 g >>> 14928 root -101 0 4204 656 588 R 31.1 0.0 47:37.21 g >>> 14961 root -101 0 4204 684 616 R 31.1 0.0 47:19.75 g >>> 14968 root -101 0 4204 636 568 R 31.1 0.0 46:27.36 g >>> 14960 root -101 0 4204 684 616 R 23.8 0.0 37:31.06 g >>> 14969 root -101 0 4204 684 616 R 23.8 0.0 38:11.50 g >>> 14925 root -101 0 4204 636 568 R 23.8 0.0 37:34.88 g >>> 14926 root -101 0 4204 684 616 R 23.8 0.0 38:27.37 g >>> 16182 root 20 0 124M 3972 3212 R 0.6 0.1 0:00.23 htop >>> 862 root 20 0 264M 5668 4832 S 0.6 0.1 0:03.30 >>> iio-sensor 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0 >>> 0:27.62 gnome-term 588 root 20 0 257M 121M 120M S 0.0 >>> 3.1 0:13.53 systemd-jo ------------------------------- HTOP >>> ------------------------------------ >>> >>> Some tasks start to use more CPU time, while others seems to use less >>> CPU than it was reserved for them. See the task 14926, it is using >>> only 23.8 % of the CPU, which is less than its 10/30 reservation. >> >> What happened here is that some runqueues have an active utilisation >> larger than 0.95. So, GRUB is decreasing the amount of time received by >> the tasks on those runqueues to consume less than 95%... This is the >> reason for the effect you noticed below: > > I see. But, AFAIK, the Linux's sched deadline measures the load > globally, not locally. So, it is not a problem having a load > than 95% > in the local queue if the global queue is < 95%. > > Am I missing something? The version of GRUB reclaiming implemented in my patches tracks a per-runqueue "active utilization", and uses it for reclaiming. >>> After some debugging, it seems that in this case GRUB is also >>> _reducing_ the runtime of the task by making the notion of consumed >>> runtime be greater than the actual consumed runtime. >> [...] >> >> Now, this is "kind of expected", because you have 11 tasks each one >> having utilisation 1/3, distributed on 4 CPUs... So, some CPU will have >> 3 tasks on it, resulting in an utilisation = 1 > 0.95. But this should >> not result in what you have seen in htop... > > Well, the sched deadline aims to schedule the M highest priority tasks, > and migrates tasks to achieve this goal. However, I am not sure if > having the whole runqueue balance is a goal/restriction/feature of the > deadline scheduler. > > Maybe this is the difference between the GRUB and sched deadline > assumptions that is causing the problem. Just thinking aloud. I think I found some strange behaviour in the push/pull mechanisms (at least it seems strange to me): a "pull" operation might end up pulling multiple tasks (I see this can simplify the implementation, but I think pulling multiple tasks is useless and might introduce some overhead even independently from my patches), and I suspect (but still I need to verify this) a "push" operation can push a task to a "wrong" destination runqueue (I mean, a task is pushed to a runqueue where it is not the earliest deadline task)... Without reclaiming, this just results in useless migrations (if I did not misunderstand something), but with my reclaiming patches this is probably the source of the strange effect you saw. But I am still investigating this, so I am not too sure... >> The real issue seems to be that at some point some runqueues have an >> active utilisation = 1.33 (4 dl tasks in the runqueue), with other >> runqueues only having 2 tasks... And this results in the huge imbalance >> in utilisations you noticed. I am trying to understand why this >> happens... It seems to me that a "pull_dl_task()" might end up pulling >> more than 1 task... Is this possible? > > Yeah, this explain the numbers. > > Brainstorm time! (sorry if it sounds obviously unfeasible): > Is it possible to think on GRUB tracking the global utilization? Yes, and I even had a version of my patches using a "per root domain" global active utilization. If needed I can update my patchset to implement the global active utilization again. I switched to per-runqueue active utilization because: - this can be used for controlling the CPU frequency scaling... And I've been told that frequency scaling is generally per-core / per-CPU (but I need to verify this) - the patches based on global active utilization needed to access this global utilization in mutual exclusion, so I used a spinlock to protect it... And I am not sure about scalability issues - I suspect there were issues when the root domain / exclusive cpuset is modified. Thanks, Luca