https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87320
Bug ID: 87320 Summary: Last iteration of vectorized loop not executed when peeling for gaps Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: kilian.verhetsel at uclouvain dot be Target Milestone: --- Created attachment 44699 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44699&action=edit Program for which GCC generates incorrect code with vectorization enabled Hello, r256635 added support for fully masked vectorized loops that require peeling for gaps, in which case it only uses one iteration of the epilogue. However, this introduced an issue where the number of iteration of the epilogue can be incorrectly computed, if peeling is not required because of the number of iterations. This causes the last iteration of the loop to not be executed, or, if the total number of iteration is equal to the vector, the loop to be executed until a crash. I have attached a small C program that illustrates this issue. I would expect it to terminate with no output when run, and this is what happens with GCC 7: $ gcc-7 crash-vectorization.c -O3 -mavx -o crash-vectorization && ./crash-vectorization # no output $ gcc-7 -v Using built-in specs. COLLECT_GCC=gcc-7 COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/7.3.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /build/gcc7/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,lto --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --program-suffix=-7 --enable-version-specific-runtime-libs Thread model: posix gcc version 7.3.1 20180814 (GCC) With GCC version, on an x86-64 machine, it crashes because of a segmentation fault: $ gcc crash-vectorization.c -O3 -mavx -o crash-vectorization && ./crash-vectorization zsh: segmentation fault (core dumped) ./crash-vectorization $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto Thread model: posix gcc version 8.2.1 20180831 (GCC) Running the program after compiling it with SSE (where the vector size is 2 instead of 4), it no longer crashes but the output still shows that the last iterations of the loop were not executed: $ gcc crash-vectorization.c -O3 -msse -o crash-vectorization && ./crash-vectorization fail: 3 fail: 119 I was able to reproduce this issue as of r263156. I believe the code at fault is found in tree-vect-loop-manip.c:vect-do-peeling:2425: poly_uint64 bound_epilog = 0; if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)) bound_epilog += vf - 1; if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) bound_epilog += 1; where bound_epilog should be set to vf instead of 1 when compiling the second loop of the attached program. On a sidenote, looking at the final x86 assembly for this program, it's not clear to me why the vectorized body of the loop could not be used for the last 4 iterations. Is this not a missed optimization?