https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111166

            Bug ID: 111166
           Summary: gcc unnecessarily creates vector operations for
                    packing 32 bit integers into struct (x86_64)
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gnu_bugzilla_gcc at catelyn dot tech
  Target Milestone: ---

Created attachment 55799
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55799&action=edit
preprocessed file that triggers the bug, as requested

GCC version: gcc version 13.2.1 20230801 (GCC)

Target: x86_64-pc-linux-gnu

Configured with: /build/gcc/src/gcc/configure
--enable-languages=ada,c,c++,d,fortran,go,lto,objc,obj-c++ --enable-bootstrap
--prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man
--infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/
--with-build-config=bootstrap-lto --with-linker-hash-style=gnu
--with-system-zlib --enable-__cxa_atexit --enable-cet=auto
--enable-checking=release --enable-clocale=gnu --enable-default-pie
--enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object
--enable-libstdcxx-backtrace --enable-link-serialization=1
--enable-linker-build-id --enable-lto --enable-multilib --enable-plugin
--enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch
--disable-werror

Command used: gcc -v -save-temps weird_gcc_behaviour.c -o weird_gcc_behaviour.s
-S -O3 -mtune=generic -march=x86-64
(same behaviour is observed with -O2)

Command gives no output to stdout nor stderr, and returns with exit code 0

When compiling the function `turn_into_struct`, a simple function that packs 4
32 bit unsigned integers arguments into a simple struct holding 4 such integers
and passes that along to `do_smth_with_4_u32`, at -O2 or -O3 the generated
assembly contains a couple vector operations (`punpckldq` and `punpcklqdq`), as
well as spilling onto the stack. This does not seem like a good idea to me,
performance wise

When compiled at -Os it instead uses `salq`, `movl` (to ensure the upper 32
bits are cleared) and `orq` to pack the data together, avoiding memory
altogether, which (intuitively to me) seems like a significantly faster
implementation as it doesn't need to touch SSE nor memory

Reply via email to