On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote: > Hi, scheduler experts! > > My cpu "iowait" time appears to be reported incorrectly. Do you know why > this could happen?
Because iowait is a magic random number that has no sane meaning. Personally I'd prefer to just delete the whole thing, except ABI :/ Also see the comment near nr_iowait(): /* * IO-wait accounting, and how its mostly bollocks (on SMP). * * The idea behind IO-wait account is to account the idle time that we could * have spend running if it were not for IO. That is, if we were to improve the * storage performance, we'd have a proportional reduction in IO-wait time. * * This all works nicely on UP, where, when a task blocks on IO, we account * idle time as IO-wait, because if the storage were faster, it could've been * running and we'd not be idle. * * This has been extended to SMP, by doing the same for each CPU. This however * is broken. * * Imagine for instance the case where two tasks block on one CPU, only the one * CPU will have IO-wait accounted, while the other has regular idle. Even * though, if the storage were faster, both could've ran at the same time, * utilising both CPUs. * * This means, that when looking globally, the current IO-wait accounting on * SMP is a lower bound, by reason of under accounting. * * Worse, since the numbers are provided per CPU, they are sometimes * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly * associated with any one particular CPU, it can wake to another CPU than it * blocked on. This means the per CPU IO-wait number is meaningless. * * Task CPU affinities can make all that even more 'interesting'. */