On Mon, Nov 21, 2016 at 07:59:56AM +0100, Martin Schwidefsky wrote:
> On Fri, 18 Nov 2016 15:47:02 +0100
> Frederic Weisbecker <fweis...@gmail.com> wrote:
> > Just because some code isn't too complex doesn't mean we really want to 
> > keep it.
> > I get regular questions about what unit does cputime_t map to on a given
> > configuration. Everybody gets confused about that. On many of the
> > patches we got on cputime for the last years, I had to fix quite some issues
> > with bad granularity assumption. In fact most fixes that came to 
> > kernel/sched/cputime.c
> > recently, after merge or review, were about people getting confused with 
> > cputime_t granularity.
> 
> These regular question you get about the cputime_t is exactly what I was 
> referring
> to. If the value would just be a u64 the guys asking the question about 
> cputime_t
> would just assume the value to be nano-seconds and then go ahead and break 
> things.

Sure, replacing cputime_t with u64 without changing the unit wouldn't help. But 
changing
it to nsecs and expect people to deduce it from the u64 type sounds a good 
direction.

> 
> > Especially for stats that come from nsecs clocks (steal and irqtime), we 
> > always have to maintain an
> > accumulator and make sure we don't lose some nanosec deltas.
> 
> Yes, for the CONFIG_IRQ_TIME_ACCOUNTING=y case.

Right.

> 
> > And we have to maintain several workarounds, sometimes even in the fastpath 
> > in
> > order to cope with the cputime_t random granularity all over.
> > 
> > Some fastpath examples:
> > 
> > * steal time accounting (need to convert nsecs to cputime then back)
> > * irqtime accounting (maintain accumulators)
> > * cputime_adjust, used on any user read of cputime (need to convert from 
> > nsecs
> >   to cputime on cputime_adjust)
> > 
> > But the worst really is about maintainance. This patchset removes around 
> > 600 lines.
> 
> Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from
> the generic cputime_jiffies.h and cputime_nsecs.h.

Well, still worth it :-)

> > > The do_account_vtime function is called once per jiffy and once per task
> > > switch. HZ is usually set to 100 for s390, the conversion once per jiffy
> > > would not be so bad, but the call on the scheduling path *will* hurt.
> > 
> > I don't think we need to flush on task switch. If we maintain the 
> > accumulators
> > on the task/thread struct instead of per-cpu, then the remaining time after
> > task switch out will be accounted on next tick after after next task switch 
> > in.
> 
> You can not properly calculate steal time if you allow sleeping tasks to sit 
> on
> up to 5*HZ worth of cpu time.

Ah, you mean that when the task goes to sleep, we shouldn't miss more than one
tick worth of system/user time but the steal time can be much higher, right?

> I think we *have* to do accounting on task switch.
> At least on s390, likely on powerpc as well. Why not make that an option for
> the architecture with the yet-to-be-written accumulating code.

Ok, how about doing the accumulation and always account on task switch for now,
we'll see later if it's worth having such an option.

> 
> > > What is even worse is the vtime_account_irq_enter path, that is call 
> > > several
> > > times for each *interrupt*, at least two times for an interrupt without
> > > additional processing and four times if a softirq is triggered.
> > 
> > Actually maintaining an accumulator to flush on ticks is probably going to 
> > increase
> > the perf because of that. account_system_time() is called twice per 
> > interrupt, and
> > such function do much more than just account the time to the task_struct 
> > and cpustat
> > fields. The same applies to userspace boundaries and context switch. The 
> > account_*_time()
> > functions can be expensive.
> 
> The account_system_time twice per interrupt can be removed with the 
> accumulation
> idea. We will have to see how expensive the accounting_xxx_time calls are on
> the context switch path.

Right.

> 
> > > 
> > > Now it has been proposed to implement lazy accounting to accumulate deltas
> > > and do the expensive conversions only infrequently. This is pretty 
> > > straight-
> > > forward for account_user_time but to do this for the account_system_time
> > > function is more complicated. The function has to differentiate between
> > > guest/hardirq/softirq and pure system time. We would need to keep sums for
> > > each bucket and provide a separate function to add to each bucket. Like
> > > account_guest_time(), account_hardirq_time(), account_softirq_time() and
> > > account_system_time(). Then it is up to the arch code to sort out the 
> > > details
> > > and call the accounting code once per jiffy for each of the buckets.
> > 
> > That wouldn't be too hard really. The s390 code in vtime.c already does 
> > that.
> 
> Yes, I agree that the accumulating change would not be too hard. Can I make 
> the
> request that we try to get that done first before doing the cleanup ?

Of course. I see you started something, I'll be glad to help!

>  
> > > We still have to do the whole thing on each task switch though.
> > 
> > Not if we maintain the deltas in the task_struct.
> > 
> > > 
> > > But I am still not happy about the approach. What is the compelling reason
> > > for this change except for the "but it looks ugly"?
> > 
> > The diffstat (600 lines removed). Also the fact that we have all these 
> > workarounds
> > in the core code just for the special case of 1 arch (s390) and a half
> > (powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE).
> > 
> > I'd much rather have all that complexity moved in a vtime_native.c shared 
> > by s390 and powerpc
> > that takes care of proper accumulation in cputime_t and flushes that on 
> > ticks in nsecs rather
> > than having all these cputime_t game all over the kernel.
> 
> The goal to have nano-seconds only in the core code is a good one. And with 
> the
> accumulator I think s390 can live with it. The change would have a real upside
> too. There are these stupid divisions for scaled cputime that we have to 
> calculate
> for every call to account_xxx_time(). These would not be done for the 
> interrupts
> anymore.

Exactly!


Thanks.

Reply via email to