On Mon, Nov 21, 2016 at 07:59:56AM +0100, Martin Schwidefsky wrote: > On Fri, 18 Nov 2016 15:47:02 +0100 > Frederic Weisbecker <fweis...@gmail.com> wrote: > > Just because some code isn't too complex doesn't mean we really want to > > keep it. > > I get regular questions about what unit does cputime_t map to on a given > > configuration. Everybody gets confused about that. On many of the > > patches we got on cputime for the last years, I had to fix quite some issues > > with bad granularity assumption. In fact most fixes that came to > > kernel/sched/cputime.c > > recently, after merge or review, were about people getting confused with > > cputime_t granularity. > > These regular question you get about the cputime_t is exactly what I was > referring > to. If the value would just be a u64 the guys asking the question about > cputime_t > would just assume the value to be nano-seconds and then go ahead and break > things.
Sure, replacing cputime_t with u64 without changing the unit wouldn't help. But changing it to nsecs and expect people to deduce it from the u64 type sounds a good direction. > > > Especially for stats that come from nsecs clocks (steal and irqtime), we > > always have to maintain an > > accumulator and make sure we don't lose some nanosec deltas. > > Yes, for the CONFIG_IRQ_TIME_ACCOUNTING=y case. Right. > > > And we have to maintain several workarounds, sometimes even in the fastpath > > in > > order to cope with the cputime_t random granularity all over. > > > > Some fastpath examples: > > > > * steal time accounting (need to convert nsecs to cputime then back) > > * irqtime accounting (maintain accumulators) > > * cputime_adjust, used on any user read of cputime (need to convert from > > nsecs > > to cputime on cputime_adjust) > > > > But the worst really is about maintainance. This patchset removes around > > 600 lines. > > Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from > the generic cputime_jiffies.h and cputime_nsecs.h. Well, still worth it :-) > > > The do_account_vtime function is called once per jiffy and once per task > > > switch. HZ is usually set to 100 for s390, the conversion once per jiffy > > > would not be so bad, but the call on the scheduling path *will* hurt. > > > > I don't think we need to flush on task switch. If we maintain the > > accumulators > > on the task/thread struct instead of per-cpu, then the remaining time after > > task switch out will be accounted on next tick after after next task switch > > in. > > You can not properly calculate steal time if you allow sleeping tasks to sit > on > up to 5*HZ worth of cpu time. Ah, you mean that when the task goes to sleep, we shouldn't miss more than one tick worth of system/user time but the steal time can be much higher, right? > I think we *have* to do accounting on task switch. > At least on s390, likely on powerpc as well. Why not make that an option for > the architecture with the yet-to-be-written accumulating code. Ok, how about doing the accumulation and always account on task switch for now, we'll see later if it's worth having such an option. > > > > What is even worse is the vtime_account_irq_enter path, that is call > > > several > > > times for each *interrupt*, at least two times for an interrupt without > > > additional processing and four times if a softirq is triggered. > > > > Actually maintaining an accumulator to flush on ticks is probably going to > > increase > > the perf because of that. account_system_time() is called twice per > > interrupt, and > > such function do much more than just account the time to the task_struct > > and cpustat > > fields. The same applies to userspace boundaries and context switch. The > > account_*_time() > > functions can be expensive. > > The account_system_time twice per interrupt can be removed with the > accumulation > idea. We will have to see how expensive the accounting_xxx_time calls are on > the context switch path. Right. > > > > > > > Now it has been proposed to implement lazy accounting to accumulate deltas > > > and do the expensive conversions only infrequently. This is pretty > > > straight- > > > forward for account_user_time but to do this for the account_system_time > > > function is more complicated. The function has to differentiate between > > > guest/hardirq/softirq and pure system time. We would need to keep sums for > > > each bucket and provide a separate function to add to each bucket. Like > > > account_guest_time(), account_hardirq_time(), account_softirq_time() and > > > account_system_time(). Then it is up to the arch code to sort out the > > > details > > > and call the accounting code once per jiffy for each of the buckets. > > > > That wouldn't be too hard really. The s390 code in vtime.c already does > > that. > > Yes, I agree that the accumulating change would not be too hard. Can I make > the > request that we try to get that done first before doing the cleanup ? Of course. I see you started something, I'll be glad to help! > > > > We still have to do the whole thing on each task switch though. > > > > Not if we maintain the deltas in the task_struct. > > > > > > > > But I am still not happy about the approach. What is the compelling reason > > > for this change except for the "but it looks ugly"? > > > > The diffstat (600 lines removed). Also the fact that we have all these > > workarounds > > in the core code just for the special case of 1 arch (s390) and a half > > (powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE). > > > > I'd much rather have all that complexity moved in a vtime_native.c shared > > by s390 and powerpc > > that takes care of proper accumulation in cputime_t and flushes that on > > ticks in nsecs rather > > than having all these cputime_t game all over the kernel. > > The goal to have nano-seconds only in the core code is a good one. And with > the > accumulator I think s390 can live with it. The change would have a real upside > too. There are these stupid divisions for scaled cputime that we have to > calculate > for every call to account_xxx_time(). These would not be done for the > interrupts > anymore. Exactly! Thanks.