https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87320

            Bug ID: 87320
           Summary: Last iteration of vectorized loop not executed when
                    peeling for gaps
           Product: gcc
           Version: 8.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kilian.verhetsel at uclouvain dot be
  Target Milestone: ---

Created attachment 44699
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44699&action=edit
Program for which GCC generates incorrect code with vectorization enabled

Hello,

r256635 added support for fully masked vectorized loops that require peeling
for gaps, in which case it only uses one iteration of the epilogue. However,
this introduced an issue where the number of iteration of the epilogue can be
incorrectly computed, if peeling is not required because of the number of
iterations. This causes the last iteration of the loop to not be executed, or,
if the total number of iteration is equal to the vector, the loop to be
executed until a crash.

I have attached a small C program that illustrates this issue. I would expect
it to terminate with no output when run, and this is what happens with GCC 7:

    $ gcc-7 crash-vectorization.c -O3 -mavx -o crash-vectorization &&
./crash-vectorization
    # no output
    $ gcc-7 -v
    Using built-in specs.                                                       
    COLLECT_GCC=gcc-7                                                           
    COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/7.3.1/lto-wrapper
    Target: x86_64-pc-linux-gnu
    Configured with: /build/gcc7/src/gcc/configure --prefix=/usr
--libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man
--infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/
--enable-languages=c,c++,lto --enable-shared --enable-threads=posix
--enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch
--disable-libssp --enable-gnu-unique-object --enable-linker-build-id
--enable-lto --enable-plugin --enable-install-libiberty
--with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-werror
--enable-checking=release --enable-default-pie --enable-default-ssp
--program-suffix=-7 --enable-version-specific-runtime-libs
    Thread model: posix
    gcc version 7.3.1 20180814 (GCC)

With GCC version, on an x86-64 machine, it crashes because of a segmentation
fault:
    $ gcc crash-vectorization.c -O3 -mavx -o crash-vectorization &&
./crash-vectorization          
    zsh: segmentation fault (core dumped)  ./crash-vectorization
    $ gcc -v
    Using built-in specs.                                                       
    COLLECT_GCC=gcc                                                             
    COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/lto-wrapper
    Target: x86_64-pc-linux-gnu
    Configured with: /build/gcc/src/gcc/configure --prefix=/usr
--libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man
--infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/
--enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared
--enable-threads=posix --enable-libmpx --with-system-zlib --with-isl
--enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu
--disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object
--enable-linker-build-id --enable-lto --enable-plugin
--enable-install-libiberty --with-linker-hash-style=gnu
--enable-gnu-indirect-function --enable-multilib --disable-werror
--enable-checking=release --enable-default-pie --enable-default-ssp
--enable-cet=auto
    Thread model: posix
    gcc version 8.2.1 20180831 (GCC)

Running the program after compiling it with SSE (where the vector size is 2
instead of 4), it no longer crashes but the output still shows that the last
iterations of the loop were not executed:

    $ gcc crash-vectorization.c -O3 -msse -o crash-vectorization &&
./crash-vectorization          
    fail: 3                                                                     
    fail: 119

I was able to reproduce this issue as of r263156.

I believe the code at fault is found in
tree-vect-loop-manip.c:vect-do-peeling:2425:

    poly_uint64 bound_epilog = 0;
    if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
        && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
      bound_epilog += vf - 1;
    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
      bound_epilog += 1;

where bound_epilog should be set to vf instead of 1 when compiling the second
loop of the attached program.

On a sidenote, looking at the final x86 assembly for this program, it's not
clear to me why the vectorized body of the loop could not be used for the last
4 iterations. Is this not a missed optimization?

Reply via email to