https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166
Bug ID: 111166 Summary: gcc unnecessarily creates vector operations for packing 32 bit integers into struct (x86_64) Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: gnu_bugzilla_gcc at catelyn dot tech Target Milestone: --- Created attachment 55799 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55799&action=edit preprocessed file that triggers the bug, as requested GCC version: gcc version 13.2.1 20230801 (GCC) Target: x86_64-pc-linux-gnu Configured with: /build/gcc/src/gcc/configure --enable-languages=ada,c,c++,d,fortran,go,lto,objc,obj-c++ --enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --with-build-config=bootstrap-lto --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-libstdcxx-backtrace --enable-link-serialization=1 --enable-linker-build-id --enable-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-werror Command used: gcc -v -save-temps weird_gcc_behaviour.c -o weird_gcc_behaviour.s -S -O3 -mtune=generic -march=x86-64 (same behaviour is observed with -O2) Command gives no output to stdout nor stderr, and returns with exit code 0 When compiling the function `turn_into_struct`, a simple function that packs 4 32 bit unsigned integers arguments into a simple struct holding 4 such integers and passes that along to `do_smth_with_4_u32`, at -O2 or -O3 the generated assembly contains a couple vector operations (`punpckldq` and `punpcklqdq`), as well as spilling onto the stack. This does not seem like a good idea to me, performance wise When compiled at -Os it instead uses `salq`, `movl` (to ensure the upper 32 bits are cleared) and `orq` to pack the data together, avoiding memory altogether, which (intuitively to me) seems like a significantly faster implementation as it doesn't need to touch SSE nor memory