Hi Cyril,

On Fri, Jun 23, 2017 at 04:03:12PM +1000, Cyril Bur wrote:
> On Thu, 2017-06-22 at 17:27 -0300, Breno Leitao wrote:
> > Currently giveup_all() calls __giveup_fpu(), __giveup_altivec(), and
> > __giveup_vsx(). But __giveup_vsx() also calls __giveup_fpu() and
> > __giveup_altivec() again, in a redudant manner.
> > 
> > Other than giving up FP and Altivec, __giveup_vsx() also disables
> > MSR_VSX on MSR, but this is already done by __giveup_{fpu,altivec}().
> > As VSX can not be enabled alone on MSR (without FP and/or VEC
> > enabled), this is also a redundancy. VSX is never enabled alone (without
> > FP and VEC) because every time VSX is enabled, as through load_up_vsx()
> > and restore_math(), FP is also enabled together.
> > 
> > This change improves giveup_all() in average just 3%, but since
> > giveup_all() is called very frequently, around 8x per CPU per second on
> > an idle machine, this change might show some noticeable improvement.
> > 
> 
> So I totally agree except this makes me quite nervous. I know we're
> quite good at always disabling VSX when we disable FPU and ALTIVEC and
> we do always turn VSX on when we enable FPU AND ALTIVEC. But still, if
> we ever get that wrong...

Right, I understand your point, we can consider this code as a
'fallback' if we, somehow, forget to disable VSX when disabling
FPU/ALTIVEC. Good point.

> I'm more interested in how this improves giveup_all() performance by so
> much, but then hardware often surprises - I guess that's the cost of a
> function call.

I got this number using ftrace. I used the 'funcgraph' tracer with the
trace_options set to 'funcgraph-duration'. Then I set set_ftrace_filter with
giveup_all().

There is also a tool that helps with it if you wish. It uses the
exactly same mechanism I used but in a more automated way. The tool name
is funcgraph by Brendan.

https://github.com/brendangregg/perf-tools/blob/master/kernel/funcgraph

> Perhaps caching the thread.regs->msr isn't a good idea.

Yes, I looked at it, but it seems that the compiler is optimizing it, keeping
it at r30, and not saving in the memory/stack. This is the code being generated
here, where r9 contains the task pointer.

 usermsr = tsk->thread.regs->msr;
        c0000000000199c4:       08 01 c9 eb     ld r30,264(r9)

 if ((usermsr & msr_all_available) == 0)                                        
                                                          
        c0000000000199c8:       60 5f 2a e9     ld r9,24416(r10)
        c0000000000199cc:       39 48 ca 7f     and.  r10,r30,r9
        c0000000000199d0:       20 00 82 40     bne c0000000000199f0 
<giveup_all+0x60>

> If we could
> branch over in the common case and but still have the call to the
> function in case something goes horribly wrong?

Yes, we can revisit it on a future opportunity. Thanks for sharing your opinion.

Reply via email to