On 04/01/2024 1:24 pm, Jan Beulich wrote:
> On 22.12.2023 23:00, Andrew Cooper wrote:
>> When livepatching is enabled, this function is used all the time.  Really do
>> check the fastpath first, and annotate it likely() as this is the right 
>> answer
>> 100% of the time (to many significant figures).
>>
>> This cuts out 3 pointer dereferences in the "nothing to do path", and it 
>> seems
>> the optimiser has an easier time too.  Bloat-o-meter reports:
>>
>>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-57 (-57)
>>   Function                                     old     new   delta
>>   check_for_livepatch_work.cold               1201    1183     -18
>>   check_for_livepatch_work                    1021     982     -39
>>
>> which isn't too shabby for no logical change.
>>
>> Signed-off-by: Andrew Cooper <andrew.coop...@citrix.com>
> Reviewed-by: Jan Beulich <jbeul...@suse.com>

Thanks.

>
>> I'm still a little disappointed with the code generation.  GCC still chooses
>> to set up the full stack frame (6 regs, +3 more slots) intermixed with the
>> per-cpu calculations.
>>
>> In isolation, GCC can check the boolean without creating a stack frame:
>>
>>   <work_to_to>:
>>     48 89 e2                mov    %rsp,%rdx
>>     48 8d 05 de e1 37 00    lea    0x37e1de(%rip),%rax        # 
>> ffff82d0405b6068 <per_cpu__work_to_do>
>>     48 81 ca ff 7f 00 00    or     $0x7fff,%rdx
>>     8b 4a c1                mov    -0x3f(%rdx),%ecx
>>     48 8d 15 45 aa 39 00    lea    0x39aa45(%rip),%rdx        # 
>> ffff82d0405d28e0 <__per_cpu_offset>
>>     48 8b 14 ca             mov    (%rdx,%rcx,8),%rdx
>>     0f b6 04 02             movzbl (%rdx,%rax,1),%eax
>>     c3                      retq
>>
>> but I can't find a way to convince GCC that it would be worth not setting up 
>> a
>> stack frame in in the common case, and having a few extra mov reg/reg's later
>> in the uncommon case.
>>
>> I haven't tried manually splitting the function into a check() and a do()
>> function.  Views on whether that might be acceptable?  At a guess, do() would
>> need to be a static noinline to avoid it turning back into what it currently
>> is.
> Or maybe move the fast-path check into an inline function, which calls the
> (renamed) out-of-line one only when the fast-path check passes? Downside
> would be that the per-CPU work_to_do variable then couldn't be static anymore.

We can't do that unfortunately.  It's called from assembly in
reset_stack_and_*()

But, I've had another idea.  I think attribute cold can help here, as it
will move the majority of the implementation into a separate section.

~Andrew

Reply via email to