On 25/11/2024 2:29 pm, Jan Beulich wrote:
> Stop the compiler from inlining non-trivial memset() and memcpy() (for
> memset() see e.g. map_vcpu_info() or kimage_load_segments() for
> examples). This way we even keep the compiler from using REP STOSQ /
> REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is
> available).
>
> With gcc10 this yields a modest .text size reduction (release build) of
> around 2k.
>
> Unfortunately these options aren't understood by the clang versions I
> have readily available for testing with; I'm unaware of equivalents.
>
> Note also that using cc-option-add is not an option here, or at least I
> couldn't make things work with it (in case the option was not supported
> by the compiler): The embedded comma in the option looks to be getting
> in the way.
>
> Requested-by: Andrew Cooper <andrew.coop...@citrix.com>
> Signed-off-by: Jan Beulich <jbeul...@suse.com>
> ---
> v3: Re-base.
> v2: New.
> ---
> The boundary values are of course up for discussion - I wasn't really
> certain whether to use 16 or 32; I'd be less certain about using yet
> larger values.
>
> Similarly whether to permit the compiler to emit REP STOSQ / REP MOVSQ
> for known size, properly aligned blocks is up for discussion.

I didn't realise there were any options like this.

The result is very different on GCC-12, with the following extremes:

add/remove: 0/0 grow/shrink: 83/71 up/down: 8764/-3913 (4851)
Function                                     old     new   delta
x86_emulate                               136966  139990   +3024
ptwr_emulated_cmpxchg                        555    1058    +503
hvm_emulate_cmpxchg                         1178    1648    +470
hvmemul_do_io                               1605    2059    +454
hvmemul_linear_mmio_access                  1060    1324    +264
hvmemul_write_cache                          655     890    +235
...
do_console_io                               1293    1170    -123
arch_get_info_guest                         2200    2072    -128
avtab_read_item                              821     692    -129
acpi_tb_create_local_fadt                    866     714    -152
xz_dec_lzma2_run                            2573    2272    -301
__hvm_copy                                  1085     737    -348
Total: Before=3902769, After=3907620, chg +0.12%

So there is a mix, but it's in a distinctly upward direction.


As a possibly-related tangent, something I did notice when playing with
-fanalyzer was that even attr(alloc_size/align) helped the code
generation for an inlined memcpy().

e.g. with _xmalloc() only getting
__attribute__((alloc_size(1),alloc_align(2))), functions like
init_domain_cpu_policy() go from:

48 8b 13                 mov    (%rbx),%rdx
48 8d 78 08              lea    0x8(%rax),%rdi
48 89 c1                 mov    %rax,%rcx
48 89 de                 mov    %rbx,%rsi
48 83 e7 f8              and    $0xfffffffffffffff8,%rdi
48 89 10                 mov    %rdx,(%rax)
48 29 f9                 sub    %rdi,%rcx
48 8b 93 b0 07 00 00     mov    0x7b0(%rbx),%rdx
48 29 ce                 sub    %rcx,%rsi
81 c1 b8 07 00 00        add    $0x7b8,%ecx
48 89 90 b0 07 00 00     mov    %rdx,0x7b0(%rax)
c1 e9 03                 shr    $0x3,%ecx
f3 48 a5                 rep movsq %ds:(%rsi),%es:(%rdi)

down to simply

48 89 c7                 mov    %rax,%rdi
b9 f7 00 00 00           mov    $0xf7,%ecx
48 89 ee                 mov    %rbp,%rsi
f3 48 a5                 rep movsq %ds:(%rsi),%es:(%rdi)

which is removing the logic to cope with a misaligned destination pointer.


As a possibly unrelated tangent, even __attribute__((malloc)) seems to
have some code gen changes.

In xenctl_bitmap_to_cpumask(), the change is simply to not align the
-ENOMEM basic block, saving 8 bytes.  This is quite reasonable because
xmalloc() genuinely failing is 0% of the time to many significant figures.

Mostly though, it's just basic block churn, which seems to be giving a
"likely not NULL" on the return value, therefore shuffling the error paths.

~Andrew

Reply via email to