https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104582
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #8) > Just trying a dumb microbenchmark: > struct S { unsigned long a, b; } s; > > __attribute__((noipa)) void > foo (unsigned long a, unsigned long b) > { > s.a = a; > s.b = b; > } > > int > main () > { > int i; > for (i = 0; i < 1000000000; i++) > foo (42, 43); > return 0; > } > the GCC 11 vs. GCC 12 code: > - movq %rdi, s(%rip) > - movq %rsi, s+8(%rip) > + movq %rdi, %xmm0 > + movq %rsi, %xmm1 > + punpcklqdq %xmm1, %xmm0 > + movaps %xmm0, s(%rip) > seems to be exactly the same speed (on i9-7960X) and the GCC 11 code is 7 > bytes smaller. The GCC 12 code is 30% slower on Zen 2 (the gpr -> xmm move is comparatively more costly there). As said we fail to account for that. But as I said the cost is not there if it's struct S { unsigned long a, b; } s; __attribute__((noipa)) void foo (unsigned long *a, unsigned long *b) { unsigned long a_ = *a; unsigned long b_ = *b; s.a = a_; s.b = b_; } which vectorizes to movq (%rdi), %xmm0 movhps (%rsi), %xmm0 movaps %xmm0, s(%rip) ret which is _smaller_ than the scalar code. So it's important to be able to distinguish those cases. The above is also a__3 1 times scalar_store costs 12 in body b__5 1 times scalar_store costs 12 in body a__3 1 times vector_store costs 12 in body <unknown> 1 times vec_construct costs 8 in prologue