On Fri, 18 Nov 2016 15:47:02 +0100
Frederic Weisbecker <fweis...@gmail.com> wrote:

> On Fri, Nov 18, 2016 at 01:08:46PM +0100, Martin Schwidefsky wrote:
> > On Thu, 17 Nov 2016 19:08:07 +0100
> > Frederic Weisbecker <fweis...@gmail.com> wrote:
> > 
> > > I'm sorry for the patchbomb, especially as I usually complain about
> > > these myself but I don't see any way to split this patchset into
> > > standalone pieces, none of which would make any sense... All I can do
> > > is to isolate about 3 cleanup patches.
> > 
> > On first glance the patches look ok-ish, but I am not happy about the
> > direction this takes.
> > 
> > I can understand the wish to consolidate the common code to a single
> > format which is nano-seconds. It will have repercussions though.
> > 
> > First the obvious problem, it does not compile for s390:
> > 
> > arch/s390/kernel/vtime.c: In function 'do_account_vtime':
> > arch/s390/kernel/vtime.c:140:25: error: implicit declaration of function
> >     'cputime_to_nsecs' [-Werror=implicit-function-declaration]
> >   account_user_time(tsk, cputime_to_nsecs(user));
> >                          ^~~~~~~~~~~~~~~~
> > arch/s390/kernel/idle.c: In function 'enabled_wait':
> > arch/s390/kernel/idle.c:46:20: error: implicit declaration of function
> >     'cputime_to_nsecs' [-Werror=implicit-function-declaration]
> >   account_idle_time(cputime_to_nsecs(idle_time));
> >                     ^~~~~~~~~~~~~~~~
> > arch/s390/kernel/idle.c: In function 'arch_cpu_idle_time':
> > arch/s390/kernel/idle.c:100:9: error: implicit declaration of function
> >     'cputime_to_nsec' [-Werror=implicit-function-declaration]
> >   return cputime_to_nsec(idle_enter ? ((idle_exit ?: now) - idle_enter) : 
> > 0);
> >          ^~~~~~~~~~~~~~~
> 
> Yes sorry I haven't yet done much build-testing. I should have written that 
> it's
> not build-tested yet. This patchset in its current state is rather an RFC.

No big deal, I got it to compile with a small change.

> > The error at idle.c:100 is a typo cputime_to_nsec vs cputime_to_nsecs.
> > The other two could probably be solved with an additional include but the
> > default cputime_to_nsecs is in include/linux/cputime.h is this:
> > 
> > #ifndef cputime_to_nsecs
> > # define cputime_to_nsecs(__ct) \
> >         (cputime_to_usecs(__ct) * NSEC_PER_USEC)
> > #endif
> > 
> > which downgrades the accuracy for s390 from better than nano-seconds
> > to micro-seconds. Not good. For the s390 cputime format you would have
> > to do
> > 
> > static inline unsigned long long cputime_to_nsecs(const cputime_t cputime)
> > {
> >         return ((__force unsigned long long) cputime * 1000) >> 12;
> > }
> 
> I agree, that loss of acurracy is my biggest worry. Hence the accumulation
> idea, but more about that later.

We can not allow that to happen, but the accumulation should take care of it.

> > 
> > But this *example* function has an overflow problem. 
> > 
> > > So currently, cputime_t serves the purpose, for s390 and
> > > powerpc (on CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y), to avoid converting
> > > arch clock counters to nanosecs or jiffies while accounting cputime.
> > 
> > The cputime_t has several purposes:
> > 1) Allow for different units in the calculations for virtual cpu time.
> >    There are currently three models: jiffies, nano-seconds and the native
> >    TOD clock format for s390 which is a bit better than nano-seconds.
> 
> Sure, I don't disagree with that, just with the way it is done (ie: stored
> and maintained in the core to this very obscure type).
> 
> > 2) Act as a marker in the common code where a virtual cpu time is used.
> >    This is more important than you might think, unfortunately it is very
> >    easy to confuse a wall-clock delta with cpu time.
> 
> There you lost me, I don't get which confusion you're pointing.

The confusion stems from the fact that you do *not* have a simple nano-second
value but a modal value that depends on the architecture. More below..

> > 3) Avoid expensive operations on the fast path to convert the native cpu
> >    time to something else. Instead move the expensive calculation to the
> >    read-out code, e.g. fs/proc.
> > 
> > You patches breaks all three of these purposes. My main gripe is with 3).
> > 
> > > But this comes at the cost of a lot of complexity and uglification
> > > in the core code to deal with such an opaque type that relies on lots of
> > > mutators and accessors in order to deal with a random granularity time
> > > unit that also involve lots of workarounds and likely some performance
> > > penalties.
> > 
> > Having an opaque type with a set of helper functions is the whole point, no?
> > And I would not call the generic implementations for jiffies or nano-seconds
> > complex, these are easy enough to understand. And what are the performance
> > penalties you are talking about?
> 
> Just because some code isn't too complex doesn't mean we really want to keep 
> it.
> I get regular questions about what unit does cputime_t map to on a given
> configuration. Everybody gets confused about that. On many of the
> patches we got on cputime for the last years, I had to fix quite some issues
> with bad granularity assumption. In fact most fixes that came to 
> kernel/sched/cputime.c
> recently, after merge or review, were about people getting confused with 
> cputime_t granularity.

These regular question you get about the cputime_t is exactly what I was 
referring
to. If the value would just be a u64 the guys asking the question about 
cputime_t
would just assume the value to be nano-seconds and then go ahead and break 
things.

> Especially for stats that come from nsecs clocks (steal and irqtime), we 
> always have to maintain an
> accumulator and make sure we don't lose some nanosec deltas.

Yes, for the CONFIG_IRQ_TIME_ACCOUNTING=y case.

> And we have to maintain several workarounds, sometimes even in the fastpath in
> order to cope with the cputime_t random granularity all over.
> 
> Some fastpath examples:
> 
> * steal time accounting (need to convert nsecs to cputime then back)
> * irqtime accounting (maintain accumulators)
> * cputime_adjust, used on any user read of cputime (need to convert from nsecs
>   to cputime on cputime_adjust)
> 
> But the worst really is about maintainance. This patchset removes around 600 
> lines.

Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from
the generic cputime_jiffies.h and cputime_nsecs.h.

> > 
> > > So this patchset proposes to convert most of the cputime_t uses to nsecs.
> > > In the end it's only used by s390 and powerpc. This all comes at the
> > > expense of those two archs which then need to perform a cputime_to_nsec()
> > > conversion everytime they update the cputime to the core. Now I expect
> > > we can leverage this performance loss with flushing the cputime only on
> > > ticks so that we accumulate time as cputime_t in between and make the
> > > conversions more rare.
> > 
> > It is not just one cputime_to_nsec that we would have to add but several.
> > Three in do_account_vtime and one in vtime_account_irq_enter.
> > 
> > The do_account_vtime function is called once per jiffy and once per task
> > switch. HZ is usually set to 100 for s390, the conversion once per jiffy
> > would not be so bad, but the call on the scheduling path *will* hurt.
> 
> I don't think we need to flush on task switch. If we maintain the accumulators
> on the task/thread struct instead of per-cpu, then the remaining time after
> task switch out will be accounted on next tick after after next task switch 
> in.

You can not properly calculate steal time if you allow sleeping tasks to sit on
up to 5*HZ worth of cpu time. I think we *have* to do accounting on task switch.
At least on s390, likely on powerpc as well. Why not make that an option for
the architecture with the yet-to-be-written accumulating code.

> > What is even worse is the vtime_account_irq_enter path, that is call several
> > times for each *interrupt*, at least two times for an interrupt without
> > additional processing and four times if a softirq is triggered.
> 
> Actually maintaining an accumulator to flush on ticks is probably going to 
> increase
> the perf because of that. account_system_time() is called twice per 
> interrupt, and
> such function do much more than just account the time to the task_struct and 
> cpustat
> fields. The same applies to userspace boundaries and context switch. The 
> account_*_time()
> functions can be expensive.

The account_system_time twice per interrupt can be removed with the accumulation
idea. We will have to see how expensive the accounting_xxx_time calls are on
the context switch path.

> > 
> > Now it has been proposed to implement lazy accounting to accumulate deltas
> > and do the expensive conversions only infrequently. This is pretty straight-
> > forward for account_user_time but to do this for the account_system_time
> > function is more complicated. The function has to differentiate between
> > guest/hardirq/softirq and pure system time. We would need to keep sums for
> > each bucket and provide a separate function to add to each bucket. Like
> > account_guest_time(), account_hardirq_time(), account_softirq_time() and
> > account_system_time(). Then it is up to the arch code to sort out the 
> > details
> > and call the accounting code once per jiffy for each of the buckets.
> 
> That wouldn't be too hard really. The s390 code in vtime.c already does that.

Yes, I agree that the accumulating change would not be too hard. Can I make the
request that we try to get that done first before doing the cleanup ?
 
> > We still have to do the whole thing on each task switch though.
> 
> Not if we maintain the deltas in the task_struct.
> 
> > 
> > But I am still not happy about the approach. What is the compelling reason
> > for this change except for the "but it looks ugly"?
> 
> The diffstat (600 lines removed). Also the fact that we have all these 
> workarounds
> in the core code just for the special case of 1 arch (s390) and a half
> (powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE).
> 
> I'd much rather have all that complexity moved in a vtime_native.c shared by 
> s390 and powerpc
> that takes care of proper accumulation in cputime_t and flushes that on ticks 
> in nsecs rather
> than having all these cputime_t game all over the kernel.

The goal to have nano-seconds only in the core code is a good one. And with the
accumulator I think s390 can live with it. The change would have a real upside
too. There are these stupid divisions for scaled cputime that we have to 
calculate
for every call to account_xxx_time(). These would not be done for the interrupts
anymore. 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

Reply via email to