On Fri, 18 Nov 2016 15:47:02 +0100 Frederic Weisbecker <fweis...@gmail.com> wrote:
> On Fri, Nov 18, 2016 at 01:08:46PM +0100, Martin Schwidefsky wrote: > > On Thu, 17 Nov 2016 19:08:07 +0100 > > Frederic Weisbecker <fweis...@gmail.com> wrote: > > > > > I'm sorry for the patchbomb, especially as I usually complain about > > > these myself but I don't see any way to split this patchset into > > > standalone pieces, none of which would make any sense... All I can do > > > is to isolate about 3 cleanup patches. > > > > On first glance the patches look ok-ish, but I am not happy about the > > direction this takes. > > > > I can understand the wish to consolidate the common code to a single > > format which is nano-seconds. It will have repercussions though. > > > > First the obvious problem, it does not compile for s390: > > > > arch/s390/kernel/vtime.c: In function 'do_account_vtime': > > arch/s390/kernel/vtime.c:140:25: error: implicit declaration of function > > 'cputime_to_nsecs' [-Werror=implicit-function-declaration] > > account_user_time(tsk, cputime_to_nsecs(user)); > > ^~~~~~~~~~~~~~~~ > > arch/s390/kernel/idle.c: In function 'enabled_wait': > > arch/s390/kernel/idle.c:46:20: error: implicit declaration of function > > 'cputime_to_nsecs' [-Werror=implicit-function-declaration] > > account_idle_time(cputime_to_nsecs(idle_time)); > > ^~~~~~~~~~~~~~~~ > > arch/s390/kernel/idle.c: In function 'arch_cpu_idle_time': > > arch/s390/kernel/idle.c:100:9: error: implicit declaration of function > > 'cputime_to_nsec' [-Werror=implicit-function-declaration] > > return cputime_to_nsec(idle_enter ? ((idle_exit ?: now) - idle_enter) : > > 0); > > ^~~~~~~~~~~~~~~ > > Yes sorry I haven't yet done much build-testing. I should have written that > it's > not build-tested yet. This patchset in its current state is rather an RFC. No big deal, I got it to compile with a small change. > > The error at idle.c:100 is a typo cputime_to_nsec vs cputime_to_nsecs. > > The other two could probably be solved with an additional include but the > > default cputime_to_nsecs is in include/linux/cputime.h is this: > > > > #ifndef cputime_to_nsecs > > # define cputime_to_nsecs(__ct) \ > > (cputime_to_usecs(__ct) * NSEC_PER_USEC) > > #endif > > > > which downgrades the accuracy for s390 from better than nano-seconds > > to micro-seconds. Not good. For the s390 cputime format you would have > > to do > > > > static inline unsigned long long cputime_to_nsecs(const cputime_t cputime) > > { > > return ((__force unsigned long long) cputime * 1000) >> 12; > > } > > I agree, that loss of acurracy is my biggest worry. Hence the accumulation > idea, but more about that later. We can not allow that to happen, but the accumulation should take care of it. > > > > But this *example* function has an overflow problem. > > > > > So currently, cputime_t serves the purpose, for s390 and > > > powerpc (on CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y), to avoid converting > > > arch clock counters to nanosecs or jiffies while accounting cputime. > > > > The cputime_t has several purposes: > > 1) Allow for different units in the calculations for virtual cpu time. > > There are currently three models: jiffies, nano-seconds and the native > > TOD clock format for s390 which is a bit better than nano-seconds. > > Sure, I don't disagree with that, just with the way it is done (ie: stored > and maintained in the core to this very obscure type). > > > 2) Act as a marker in the common code where a virtual cpu time is used. > > This is more important than you might think, unfortunately it is very > > easy to confuse a wall-clock delta with cpu time. > > There you lost me, I don't get which confusion you're pointing. The confusion stems from the fact that you do *not* have a simple nano-second value but a modal value that depends on the architecture. More below.. > > 3) Avoid expensive operations on the fast path to convert the native cpu > > time to something else. Instead move the expensive calculation to the > > read-out code, e.g. fs/proc. > > > > You patches breaks all three of these purposes. My main gripe is with 3). > > > > > But this comes at the cost of a lot of complexity and uglification > > > in the core code to deal with such an opaque type that relies on lots of > > > mutators and accessors in order to deal with a random granularity time > > > unit that also involve lots of workarounds and likely some performance > > > penalties. > > > > Having an opaque type with a set of helper functions is the whole point, no? > > And I would not call the generic implementations for jiffies or nano-seconds > > complex, these are easy enough to understand. And what are the performance > > penalties you are talking about? > > Just because some code isn't too complex doesn't mean we really want to keep > it. > I get regular questions about what unit does cputime_t map to on a given > configuration. Everybody gets confused about that. On many of the > patches we got on cputime for the last years, I had to fix quite some issues > with bad granularity assumption. In fact most fixes that came to > kernel/sched/cputime.c > recently, after merge or review, were about people getting confused with > cputime_t granularity. These regular question you get about the cputime_t is exactly what I was referring to. If the value would just be a u64 the guys asking the question about cputime_t would just assume the value to be nano-seconds and then go ahead and break things. > Especially for stats that come from nsecs clocks (steal and irqtime), we > always have to maintain an > accumulator and make sure we don't lose some nanosec deltas. Yes, for the CONFIG_IRQ_TIME_ACCOUNTING=y case. > And we have to maintain several workarounds, sometimes even in the fastpath in > order to cope with the cputime_t random granularity all over. > > Some fastpath examples: > > * steal time accounting (need to convert nsecs to cputime then back) > * irqtime accounting (maintain accumulators) > * cputime_adjust, used on any user read of cputime (need to convert from nsecs > to cputime on cputime_adjust) > > But the worst really is about maintainance. This patchset removes around 600 > lines. Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from the generic cputime_jiffies.h and cputime_nsecs.h. > > > > > So this patchset proposes to convert most of the cputime_t uses to nsecs. > > > In the end it's only used by s390 and powerpc. This all comes at the > > > expense of those two archs which then need to perform a cputime_to_nsec() > > > conversion everytime they update the cputime to the core. Now I expect > > > we can leverage this performance loss with flushing the cputime only on > > > ticks so that we accumulate time as cputime_t in between and make the > > > conversions more rare. > > > > It is not just one cputime_to_nsec that we would have to add but several. > > Three in do_account_vtime and one in vtime_account_irq_enter. > > > > The do_account_vtime function is called once per jiffy and once per task > > switch. HZ is usually set to 100 for s390, the conversion once per jiffy > > would not be so bad, but the call on the scheduling path *will* hurt. > > I don't think we need to flush on task switch. If we maintain the accumulators > on the task/thread struct instead of per-cpu, then the remaining time after > task switch out will be accounted on next tick after after next task switch > in. You can not properly calculate steal time if you allow sleeping tasks to sit on up to 5*HZ worth of cpu time. I think we *have* to do accounting on task switch. At least on s390, likely on powerpc as well. Why not make that an option for the architecture with the yet-to-be-written accumulating code. > > What is even worse is the vtime_account_irq_enter path, that is call several > > times for each *interrupt*, at least two times for an interrupt without > > additional processing and four times if a softirq is triggered. > > Actually maintaining an accumulator to flush on ticks is probably going to > increase > the perf because of that. account_system_time() is called twice per > interrupt, and > such function do much more than just account the time to the task_struct and > cpustat > fields. The same applies to userspace boundaries and context switch. The > account_*_time() > functions can be expensive. The account_system_time twice per interrupt can be removed with the accumulation idea. We will have to see how expensive the accounting_xxx_time calls are on the context switch path. > > > > Now it has been proposed to implement lazy accounting to accumulate deltas > > and do the expensive conversions only infrequently. This is pretty straight- > > forward for account_user_time but to do this for the account_system_time > > function is more complicated. The function has to differentiate between > > guest/hardirq/softirq and pure system time. We would need to keep sums for > > each bucket and provide a separate function to add to each bucket. Like > > account_guest_time(), account_hardirq_time(), account_softirq_time() and > > account_system_time(). Then it is up to the arch code to sort out the > > details > > and call the accounting code once per jiffy for each of the buckets. > > That wouldn't be too hard really. The s390 code in vtime.c already does that. Yes, I agree that the accumulating change would not be too hard. Can I make the request that we try to get that done first before doing the cleanup ? > > We still have to do the whole thing on each task switch though. > > Not if we maintain the deltas in the task_struct. > > > > > But I am still not happy about the approach. What is the compelling reason > > for this change except for the "but it looks ugly"? > > The diffstat (600 lines removed). Also the fact that we have all these > workarounds > in the core code just for the special case of 1 arch (s390) and a half > (powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE). > > I'd much rather have all that complexity moved in a vtime_native.c shared by > s390 and powerpc > that takes care of proper accumulation in cputime_t and flushes that on ticks > in nsecs rather > than having all these cputime_t game all over the kernel. The goal to have nano-seconds only in the core code is a good one. And with the accumulator I think s390 can live with it. The change would have a real upside too. There are these stupid divisions for scaled cputime that we have to calculate for every call to account_xxx_time(). These would not be done for the interrupts anymore. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.