On 22.12.2023 23:00, Andrew Cooper wrote:
> When livepatching is enabled, this function is used all the time.  Really do
> check the fastpath first, and annotate it likely() as this is the right answer
> 100% of the time (to many significant figures).
> 
> This cuts out 3 pointer dereferences in the "nothing to do path", and it seems
> the optimiser has an easier time too.  Bloat-o-meter reports:
> 
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>   Function                                     old     new   delta
>   check_for_livepatch_work.cold               1201    1183     -18
>   check_for_livepatch_work                    1021     982     -39
> 
> which isn't too shabby for no logical change.
> 
> Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com>

Reviewed-by: Jan Beulich <jbeul...@suse.com>

> I'm still a little disappointed with the code generation.  GCC still chooses
> to set up the full stack frame (6 regs, +3 more slots) intermixed with the
> per-cpu calculations.
> 
> In isolation, GCC can check the boolean without creating a stack frame:
> 
>   <work_to_to>:
>     48 89 e2                mov    %rsp,%rdx
>     48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # 
> ffff82d0405b6068 <per_cpu__work_to_do>
>     48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
>     8b 4a c1                mov    -0x3f(%rdx),%ecx
>     48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # 
> ffff82d0405d28e0 <__per_cpu_offset>
>     48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
>     0f b6 04 02             movzbl (%rdx,%rax,1),%eax
>     c3                      retq
> 
> but I can't find a way to convince GCC that it would be worth not setting up a
> stack frame in in the common case, and having a few extra mov reg/reg's later
> in the uncommon case.
> 
> I haven't tried manually splitting the function into a check() and a do()
> function.  Views on whether that might be acceptable?  At a guess, do() would
> need to be a static noinline to avoid it turning back into what it currently
> is.

Or maybe move the fast-path check into an inline function, which calls the
(renamed) out-of-line one only when the fast-path check passes? Downside
would be that the per-CPU work_to_do variable then couldn't be static anymore.

Jan

Reply via email to