https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #6 from Chris Elrod <elrodc at gmail dot com> --- Hongtao Liu, I do think that one should ideally be able to get optimal codegen when using 512-bit builtin vectors or vector intrinsics, without needing to set `-mprefer-vector-width=512` (and, currently, also setting `-mtune-ctrl=avx512_move_by_pieces`). For example, if I remove `-mprefer-vector-width=512`, I get prod(Dual<Dual<double, 7l>, 2l>&, Dual<Dual<double, 7l>, 2l> const&, Dual<Dual<double, 7l>, 2l> const&): push rbp mov eax, -2 kmovb k1, eax mov rbp, rsp and rsp, -64 sub rsp, 264 vmovdqa ymm4, YMMWORD PTR [rsi+128] vmovapd zmm8, ZMMWORD PTR [rsi] vmovapd zmm9, ZMMWORD PTR [rdx] vmovdqa ymm6, YMMWORD PTR [rsi+64] vmovdqa YMMWORD PTR [rsp+8], ymm4 vmovdqa ymm4, YMMWORD PTR [rdx+96] vbroadcastsd zmm0, xmm8 vmovdqa ymm7, YMMWORD PTR [rsi+96] vbroadcastsd zmm1, xmm9 vmovdqa YMMWORD PTR [rsp-56], ymm6 vmovdqa ymm5, YMMWORD PTR [rdx+128] vmovdqa ymm6, YMMWORD PTR [rsi+160] vmovdqa YMMWORD PTR [rsp+168], ymm4 vxorpd xmm4, xmm4, xmm4 vaddpd zmm0, zmm0, zmm4 vaddpd zmm1, zmm1, zmm4 vmovdqa YMMWORD PTR [rsp-24], ymm7 vmovdqa ymm7, YMMWORD PTR [rdx+64] vmovapd zmm3, ZMMWORD PTR [rsp-56] vmovdqa YMMWORD PTR [rsp+40], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+160] vmovdqa YMMWORD PTR [rsp+200], ymm5 vmulpd zmm2, zmm0, zmm9 vmovdqa YMMWORD PTR [rsp+136], ymm7 vmulpd zmm5, zmm1, zmm3 vbroadcastsd zmm3, xmm3 vmovdqa YMMWORD PTR [rsp+232], ymm6 vaddpd zmm3, zmm3, zmm4 vmovapd zmm7, zmm2 vmovapd zmm2, ZMMWORD PTR [rsp+8] vfmadd231pd zmm7{k1}, zmm8, zmm1 vmovapd zmm6, zmm5 vmovapd zmm5, ZMMWORD PTR [rsp+136] vmulpd zmm1, zmm1, zmm2 vfmadd231pd zmm6{k1}, zmm9, zmm3 vbroadcastsd zmm2, xmm2 vmovapd zmm3, ZMMWORD PTR [rsp+200] vaddpd zmm2, zmm2, zmm4 vmovapd ZMMWORD PTR [rdi], zmm7 vfmadd231pd zmm1{k1}, zmm9, zmm2 vmulpd zmm2, zmm0, zmm5 vbroadcastsd zmm5, xmm5 vmulpd zmm0, zmm0, zmm3 vbroadcastsd zmm3, xmm3 vaddpd zmm5, zmm5, zmm4 vaddpd zmm3, zmm3, zmm4 vfmadd231pd zmm2{k1}, zmm8, zmm5 vfmadd231pd zmm0{k1}, zmm8, zmm3 vaddpd zmm2, zmm2, zmm6 vaddpd zmm0, zmm0, zmm1 vmovapd ZMMWORD PTR [rdi+64], zmm2 vmovapd ZMMWORD PTR [rdi+128], zmm0 vzeroupper leave ret prod(Dual<Dual<double, 8l>, 2l>&, Dual<Dual<double, 8l>, 2l> const&, Dual<Dual<double, 8l>, 2l> const&): push rbp mov rbp, rsp and rsp, -64 sub rsp, 648 vmovdqa ymm5, YMMWORD PTR [rsi+224] vmovdqa ymm3, YMMWORD PTR [rsi+352] vmovapd zmm0, ZMMWORD PTR [rdx+64] vmovdqa ymm2, YMMWORD PTR [rsi+320] vmovdqa YMMWORD PTR [rsp+104], ymm5 vmovdqa ymm5, YMMWORD PTR [rdx+224] vmovdqa ymm7, YMMWORD PTR [rsi+128] vmovdqa YMMWORD PTR [rsp+232], ymm3 vmovsd xmm3, QWORD PTR [rsi] vmovdqa ymm6, YMMWORD PTR [rsi+192] vmovdqa YMMWORD PTR [rsp+488], ymm5 vmovdqa ymm4, YMMWORD PTR [rdx+192] vmovapd zmm1, ZMMWORD PTR [rsi+64] vbroadcastsd zmm5, xmm3 vmovdqa YMMWORD PTR [rsp+200], ymm2 vmovdqa ymm2, YMMWORD PTR [rdx+320] vmulpd zmm8, zmm5, zmm0 vmovdqa YMMWORD PTR [rsp+8], ymm7 vmovdqa ymm7, YMMWORD PTR [rsi+256] vmovdqa YMMWORD PTR [rsp+72], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+128] vmovdqa YMMWORD PTR [rsp+584], ymm2 vmovsd xmm2, QWORD PTR [rdx] vmovdqa YMMWORD PTR [rsp+136], ymm7 vmovdqa ymm7, YMMWORD PTR [rdx+256] vmovdqa YMMWORD PTR [rsp+392], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+352] vmulsd xmm10, xmm3, xmm2 vmovdqa YMMWORD PTR [rsp+456], ymm4 vbroadcastsd zmm4, xmm2 vfmadd231pd zmm8, zmm4, zmm1 vmovdqa YMMWORD PTR [rsp+520], ymm7 vmovdqa YMMWORD PTR [rsp+616], ymm6 vmulpd zmm9, zmm4, ZMMWORD PTR [rsp+72] vmovsd xmm6, QWORD PTR [rsp+520] vmulpd zmm4, zmm4, ZMMWORD PTR [rsp+200] vmulpd zmm11, zmm5, ZMMWORD PTR [rsp+456] vmovsd QWORD PTR [rdi], xmm10 vmulpd zmm5, zmm5, ZMMWORD PTR [rsp+584] vmovapd ZMMWORD PTR [rdi+64], zmm8 vfmadd231pd zmm9, zmm0, QWORD PTR [rsp+8]{1to8} vfmadd231pd zmm4, zmm0, QWORD PTR [rsp+136]{1to8} vmovsd xmm0, QWORD PTR [rsp+392] vmulsd xmm7, xmm3, xmm0 vbroadcastsd zmm0, xmm0 vmulsd xmm3, xmm3, xmm6 vfmadd132pd zmm0, zmm11, zmm1 vbroadcastsd zmm6, xmm6 vfmadd132pd zmm1, zmm5, zmm6 vfmadd231sd xmm7, xmm2, QWORD PTR [rsp+8] vfmadd132sd xmm2, xmm3, QWORD PTR [rsp+136] vaddpd zmm0, zmm0, zmm9 vaddpd zmm1, zmm1, zmm4 vmovapd ZMMWORD PTR [rdi+192], zmm0 vmovsd QWORD PTR [rdi+128], xmm7 vmovsd QWORD PTR [rdi+256], xmm2 vmovapd ZMMWORD PTR [rdi+320], zmm1 vzeroupper leave ret GCC respects the vector builtins and uses 512 bit ops, but then does splits and spills across function boundaries. So, what I'm arguing is, while it would be great to respect `-mprefer-vector-width=512`, it should ideally also be able to respect vector builtins/intrinsics, so that one can use full width vectors without also having to set `-mprefer-vector-width=512 -mtune-control=avx512_move_by_pieces`.