[Bug fortran/93734] New: Invalid code generated with -O2 -march=haswell -ftree-vectorize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93734 Bug ID: 93734 Summary: Invalid code generated with -O2 -march=haswell -ftree-vectorize Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: bartoldeman at users dot sourceforge.net Target Milestone: --- Created attachment 47837 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47837&action=edit Fortran code that prints 0 if correct, and -9 if miscompiled The attached code prints -9. if compiled using gfortran -O2 -march=haswell -ftree-vectorize bug.f90 -o bug ./bug -9. using GNU Fortran (Debian 8.3.0-6) 8.3.0 Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. also reproduceable on GCC 9.2.0, but not with GCC 7.3.0 and earlier. The correct answer is 1-1=0. (I found this issue first when compiling the reference BLAS using those options and running the "zblat2" tests, the test is a much reduced version of ztrsv, see http://www.netlib.org/lapack/explore-html/dc/dc1/group__complex16__blas__level2_ga99cc66f0833474d6607e6ea7dbe2f9bd.html#ga99cc66f0833474d6607e6ea7dbe2f9bd)
[Bug target/52838] New: [x32] missed optimization for pointer return value
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52838 Bug #: 52838 Summary: [x32] missed optimization for pointer return value Classification: Unclassified Product: gcc Version: 4.7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: bartolde...@users.sourceforge.net The program test.c: extern void *foo(void); extern void bar(void*); void test(void) { bar(foo()); } when compiled with gcc-4.7 -mx32 -Os -S test.c produces: .file"test.c" .text .globltest .typetest, @function test: .LFB0: .cfi_startproc pushq%rax .cfi_def_cfa_offset 16 callfoo popq%rdx .cfi_def_cfa_offset 8 movq%rax, %rdi jmpbar .cfi_endproc .LFE0: .sizetest, .-test .ident"GCC: (Debian 4.7.0-1) 4.7.0" .section.note.GNU-stack,"",@progbits Here "movq %rax, %rdi" could be replaced by "movl %eax, %edi", saving one prefix byte 0x48.
[Bug tree-optimization/107254] New: Wrong vectorizer code (GCC 11 only, Fortran)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107254 Bug ID: 107254 Summary: Wrong vectorizer code (GCC 11 only, Fortran) Product: gcc Version: 11.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bartoldeman at users dot sourceforge.net Target Milestone: --- Created attachment 53703 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53703&action=edit Test case The following code gives the wrong result (-1. instead of 0.) with gfortran 11.3 (also tested with the 11.3.1 20221007 prerelease) when given the options `-O2 -ftree-vectorize -march=core-avx` for x86_64. There's no issue with GCC 9,10, and 12. It could be related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107212 except that bug also affects GCC 12. This issue came up from testing the reference LAPACK with -ftree-vectorize enabled, where many more tests failed with recent GCC (11/12), see https://github.com/easybuilders/easybuild-easyconfigs/issues/16380 $ gfortran -O2 -ftree-vectorize -march=core-avx2 dhgeqz2.f90; ./a.out -1. $ gfortran -Wall -O2 dhgeqz2.f90; ./a.out 0. subroutine dlartg( f, g, s, r ) implicit none double precision :: f, g, r, s double precision :: d, p d = sqrt( f*f + g*g ) p = 1.d0 / d if( abs( f ) > 1 ) then s = g*sign( p, f ) r = sign( d, f ) else s = g*sign( p, f ) r = sign( d, f ) end if end subroutine subroutine dhgeqz( n, h, t ) implicit none integern double precision h( n, * ), t( n, * ) integerjc double precision c, s, temp, temp2, tempr temp2 = 10d0 call dlartg( 10d0, temp2, s, tempr ) c = 0.9d0 s = 1.d0 do jc = 1, n temp = c*h( 1, jc ) + s*h( 2, jc ) h( 2, jc ) = -s*h( 1, jc ) + c*h( 2, jc ) h( 1, jc ) = temp temp2 = c*t( 1, jc ) + s*t( 2, jc ) ! t(2,2)=-s*t(1,2)+c*t(2,2)=-0.9*0+1*0=0 t( 2, jc ) = -s*t( 1, jc ) + c*t( 2, jc ) t( 1, jc ) = temp2 enddo end subroutine dhgeqz program test implicit none double precision h(2,2), t(2,2) h = 0 t(1,1) = 1 t(2,1) = 0 t(1,2) = 0 t(2,2) = 0 call dhgeqz( 2, h, t ) print *,t(2,2) end program test
[Bug tree-optimization/107254] [11/12 Regression] Wrong vectorizer code (Fortran) since r11-1501-gda2b7c7f0a136b4d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107254 --- Comment #10 from bartoldeman at users dot sourceforge.net --- Thanks for the fix! I can confirm that, when applied to 11.3 (with files renamed from .cc to .c), it fixes the issue, and with it, thousands of test failures in the reference LAPACK test suite. My findings for LAPACK are in this issue here: https://github.com/Reference-LAPACK/lapack/issues/732
[Bug fortran/107294] New: Missed optimization: multiplying real with complex number in Fortran (only)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107294 Bug ID: 107294 Summary: Missed optimization: multiplying real with complex number in Fortran (only) Product: gcc Version: 11.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: bartoldeman at users dot sourceforge.net Target Milestone: --- This code: complex function csmul(a, b) real, value :: a complex, value :: b csmul = a * b end function csmul produces this assembly on x86-64 (11.3, -O2) 0: 66 0f d6 4c 24 f8 movq %xmm1,-0x8(%rsp) 6: f3 0f 10 64 24 fc movss -0x4(%rsp),%xmm4 c: f3 0f 10 4c 24 f8 movss -0x8(%rsp),%xmm1 12: 0f 28 d0movaps %xmm0,%xmm2 15: 66 0f ef db pxor %xmm3,%xmm3 # xmm3 = 0 19: f3 0f 59 d1 mulss %xmm1,%xmm2 1d: 0f 28 ecmovaps %xmm4,%xmm5 20: f3 0f 59 eb mulss %xmm3,%xmm5 # xmm5 = 0 24: f3 0f 59 c4 mulss %xmm4,%xmm0 28: f3 0f 59 cb mulss %xmm3,%xmm1 # xmm1 = 0 2c: f3 0f 5c d5 subss %xmm5,%xmm2 # xmm2 unchanged 30: f3 0f 58 c1 addss %xmm1,%xmm0 # xmm0 unchanged 34: f3 0f 11 54 24 f0 movss %xmm2,-0x10(%rsp) 3a: f3 0f 11 44 24 f4 movss %xmm0,-0xc(%rsp) 40: f3 0f 7e 44 24 f0 movq -0x10(%rsp),%xmm0 46: c3 retq here xmm3 (imaginary part of a, promoted to complex) is set to 0 but this is not exploited in the remainder. On the other hand the assembly for the corresponding C code looks good, with two mul instructions, as expected: float _Complex csmul(float a, float _Complex b) { return a * b; } : 0: 66 0f d6 4c 24 f8 movq %xmm1,-0x8(%rsp) 6: f3 0f 10 4c 24 f8 movss -0x8(%rsp),%xmm1 c: f3 0f 59 c8 mulss %xmm0,%xmm1 10: f3 0f 59 44 24 fc mulss -0x4(%rsp),%xmm0 16: f3 0f 11 4c 24 f0 movss %xmm1,-0x10(%rsp) 1c: f3 0f 11 44 24 f4 movss %xmm0,-0xc(%rsp) 22: f3 0f 7e 44 24 f0 movq -0x10(%rsp),%xmm0 28: c3 retq The same issue is still present in trunk, according to godbolt.org.
[Bug fortran/107294] Missed optimization: multiplying real with complex number in Fortran (only)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107294 bartoldeman at users dot sourceforge.net changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from bartoldeman at users dot sourceforge.net --- Thanks for the explanation, finding an example with NaNs you get 0.0 * (NaN + 0.0i) = NaN + 0.0i for C with annex G.5.1 but NaN + NaN i for Fortran, unless you specify -fno-signed-zeros. program main use, intrinsic :: ieee_arithmetic, only: IEEE_Value, IEEE_QUIET_NAN use, intrinsic :: iso_fortran_env, only: real32 real(real32) :: a, nan complex(real32) :: cnan nan = IEEE_VALUE(nan, IEEE_QUIET_NAN) cnan = cmplx(nan, 0.0) zero = 0.0 print *, zero, cnan, zero * cnan end illustrates this 0. ( NaN, 0.) ( NaN, NaN) vs 0. ( NaN, 0.) ( NaN, 0.)
[Bug fortran/107294] Missed optimization: multiplying real with complex number in Fortran (only)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107294 bartoldeman at users dot sourceforge.net changed: What|Removed |Added Resolution|FIXED |WONTFIX
[Bug fortran/103023] New: ICE (Segmentation fault) with !$OMP DECLARE SIMD(func) linear(ref(u))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103023 Bug ID: 103023 Summary: ICE (Segmentation fault) with !$OMP DECLARE SIMD(func) linear(ref(u)) Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: bartoldeman at users dot sourceforge.net Target Milestone: --- Created attachment 51717 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51717&action=edit Test case for crash For the following Fortran code gfortran gives a SIGSEGV (tested GCC 9.3,10.3 locally, 11.2 and trunk on Godbolt) subroutine func(u,ndim) !$OMP DECLARE SIMD(func) linear(ref(u)) integer, intent(in) :: ndim double precision, intent(in) :: u(ndim) end subroutine func Here's the output for 10.3: $ gfortran -c -fopenmp-simd openfun2.f90 openfun2.f90:1:15: 1 | subroutine func(u,ndim) | 1 internal compiler error: Segmentation fault 0xc147cf crash_signal ../../gcc/toplev.c:328 0x948ae6 size_binop_loc(unsigned int, tree_code, tree_node*, tree_node*) ../../gcc/fold-const.c:1906 0x7b8258 gfc_trans_omp_clauses ../../gcc/fortran/trans-openmp.c:2324 0x7bb168 gfc_trans_omp_declare_simd(gfc_namespace*) ../../gcc/fortran/trans-openmp.c:5838 0x77b767 gfc_create_function_decl(gfc_namespace*, bool) ../../gcc/fortran/trans-decl.c:3069 0x77b767 gfc_generate_function_code(gfc_namespace*) ../../gcc/fortran/trans-decl.c:6744 0x6f679e translate_all_program_units ../../gcc/fortran/parse.c:6306 0x6f679e gfc_parse_file() ../../gcc/fortran/parse.c:6567 0x74dfbf gfc_be_parse_file ../../gcc/fortran/f95-lang.c:210 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions.
[Bug fortran/103023] ICE (Segmentation fault) with !$OMP DECLARE SIMD(func) linear(ref(u))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103023 --- Comment #2 from bartoldeman at users dot sourceforge.net --- Yes this is about the ICE mainly. It was stripped down from this, which HAS uniform. subroutine func(u,f,ndim) !$OMP DECLARE SIMD(func) uniform(ndim) linear(ref(f,u):1) integer, intent(in) :: ndim double precision, intent(in) :: u(ndim) double precision, intent(out) :: f(ndim) f(1) = u(1) + u(2) f(2) = u(1) - u(2) end subroutine func subroutine main(u,f) double precision, intent(in) :: u(8) double precision, intent(out) :: f(8) !$OMP SIMD do i=1,8,2 call func(u(i),f(i),2) enddo end subroutine main If I leave out ndim and hardcode "2" in func (:: u(2) and :: f(2)), or let the auto-vectorizer and inliner do its work this produces good code (though it would be better with u and f transposed, as basically the code transposes it to two ymm registers in the asm output. With general "ndim" that could still work, e.g. with ndim=3 and 3 equations for u(1:3) -> f(1:3), you'd work with 3 vector registers. Now you may wonder why "ndim" here, since we know it's "2": this comes from feeding a user-defined function into a larger program (that processes e.g. maps) where that same user needs to specify ndim as a parameter. Intel (ifort) doesn't like this at all from what I can see: openfun.f90(1): error #6080: Only scalar variables may be referenced in a LINEAR or MONOTONIC clause. [U] subroutine func(u,f) ^ openfun.f90(1): error #6080: Only scalar variables may be referenced in a LINEAR or MONOTONIC clause. [F] subroutine func(u,f) --^ compilation aborted for openfun.f90 (code 1)
[Bug tree-optimization/107451] New: Segmentation fault with vectorized code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107451 Bug ID: 107451 Summary: Segmentation fault with vectorized code. Product: gcc Version: 11.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bartoldeman at users dot sourceforge.net Target Milestone: --- Created attachment 53785 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53785&action=edit Test case The following code: double dot(int n, const double *x, int inc_x, const double *y) { int i, ix; double dot[4] = { 0.0, 0.0, 0.0, 0.0 } ; ix=0; for(i = 0; i < n; i++) { dot[0] += x[ix] * y[ix] ; dot[1] += x[ix+1] * y[ix+1] ; dot[2] += x[ix] * y[ix+1] ; dot[3] += x[ix+1] * y[ix] ; ix += inc_x ; } return dot[0] + dot[1] + dot[2] + dot[3]; } int main(void) { double x = 0, y = 0; return dot(1, &x, 4096*4096, &y); } crashes with (on Linux x86-64) $ gcc -O2 -ftree-vectorize -march=haswell crash.c -o crash $ ./a.out Segmentation fault for GCC 11.3.0 and also the current prerelease (gcc version 11.3.1 20221021), and also when patched with the patches from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107254 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107212. The loop code assembly is as follows: 18: c5 f9 10 1e vmovupd (%rsi),%xmm3 1c: c5 f9 10 21 vmovupd (%rcx),%xmm4 20: ff c2 inc%edx 22: c4 e3 65 18 0c 06 01vinsertf128 $0x1,(%rsi,%rax,1),%ymm3,%ymm1 29: c4 e3 5d 18 04 01 01vinsertf128 $0x1,(%rcx,%rax,1),%ymm4,%ymm0 30: 48 01 c6add%rax,%rsi 33: 48 01 c1add%rax,%rcx 36: c4 e3 fd 01 c9 11 vpermpd $0x11,%ymm1,%ymm1 3c: c4 e3 fd 01 c0 14 vpermpd $0x14,%ymm0,%ymm0 42: c4 e2 f5 b8 d0 vfmadd231pd %ymm0,%ymm1,%ymm2 47: 39 fa cmp%edi,%edx 49: 75 cd jne18 what happens here is that the vinsertf128 instructions take the element from one loop iteration later, and those get put in the high halves of ymm0 and ymm1. The vpermpd instructions then throw away those high halves again, so e.g. they turn 1,2,3,4 into 2,1,2,1 and 1,2,2,1 respectively. So the result is correct but the superfluous vinsertf128 instructions access memory potentially past the end of x or y and thus a produce a segfault. related issue (coming from OpenBLAS): https://github.com/easybuilders/easybuild-easyconfigs/issues/16387 may also be related: https://github.com/xianyi/OpenBLAS/issues/3740#issuecomment-1233899834 (the particular comment shows very similar code but it's for GCC 12 which vectorizes by default, OpenBLAS worked around this by disabling the tree vectorizer there but only on Mac OS and Windows).
[Bug tree-optimization/107451] [11/12/13 Regression] Segmentation fault with vectorized code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107451 bartoldeman at users dot sourceforge.net changed: What|Removed |Added Attachment #53785|0 |1 is obsolete|| --- Comment #3 from bartoldeman at users dot sourceforge.net --- Created attachment 53786 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53786&action=edit Corrected test case In my eagerness to make it as short as possible I made it too short indeed!
[Bug tree-optimization/107647] New: GCC 12.2.0 may produce FMAs even with -ffp-contract=off
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647 Bug ID: 107647 Summary: GCC 12.2.0 may produce FMAs even with -ffp-contract=off Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bartoldeman at users dot sourceforge.net Target Milestone: --- I stumped upon an example where GCC generates FMA instruction even when FMAs are disabled using -ffp-contract=off (extracted from https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/cscal.c) $ cat cscal.c void cscal(int n, float da_r, float *x) { for (int i = 0; i < n; i += 4) { float temp0 = da_r * x[i] - x[i+1]; float temp1 = da_r * x[i+2] - x[i+3]; x[i+1] = da_r * x[i+1] + x[i]; x[i+3] = da_r * x[i+3] + x[i+2]; x[i] = temp0; x[i+2] = temp1; } } $ gcc -S -march=haswell -O2 -ffp-contract=off cscal.c $ grep fma cscal.s vfmaddsub231ps %xmm0, %xmm2, %xmm1 I would expect there to be no FMA instructions in there.
[Bug tree-optimization/107647] GCC 12.2.0 may produce FMAs even with -ffp-contract=off
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107647 --- Comment #1 from bartoldeman at users dot sourceforge.net --- According to godbolt it's still producing FMAs on trunk: https://godbolt.org/z/aWh6d1E4E
[Bug tree-optimization/107451] [11/12/13 Regression] Segmentation fault with vectorized code since r11-6434
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107451 --- Comment #9 from bartoldeman at users dot sourceforge.net --- I ended up using -mprefer-vector-width=128 as a workaround myself (via __attribute__((target("prefer-vector-width=128", so there is still some AVX vectorization.
[Bug target/101683] New: Floating point exception for double->unsigned conversion on avx512 only
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101683 Bug ID: 101683 Summary: Floating point exception for double->unsigned conversion on avx512 only Product: gcc Version: 10.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: bartoldeman at users dot sourceforge.net Target Milestone: --- Created attachment 51222 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51222&action=edit File to reproduce For this code: #define _GNU_SOURCE #include int main(int argc, char **argv) { feenableexcept(FE_INVALID); double argcm10 = argc / -0.1; return (unsigned)(argcm10 < 0.0 ? 0 : argcm10); } $ gcc -O -march=skylake-avx512 fpexcept.c -lm $ ./a.out Floating point exception the instructions vcvttsd2usi %xmm0, %eax vxorpd %xmm1, %xmm1, %xmm1 vucomisd%xmm0, %xmm1 movl$0, %edx cmova %edx, %eax are generated just after the division, so the conversion happens before the comparison. "If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2^w – 1 is returned, where w represents the number of bits in the destination format." so when masked, for argcm10 = -10.0 the value 2^w-1 is discarded and all is well, since it's < 0, but not when unmasked. I can reproduce this issue with 9.3 as well, but not with 8.4 (the generated code is correct for 8.4). I have not tried 11.1 yet. Note: I found this issue with the UCX library when compiled with -march=skylake-avx512, this example is stripped down from: https://github.com/openucx/ucx/blob/f5362f5e6f80d930b88c44c63b4d8d71cf91d214/src/ucp/core/ucp_ep.c#L2699
[Bug rtl-optimization/101683] Floating point exception for double->unsigned conversion on avx512 only
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101683 --- Comment #6 from bartoldeman at users dot sourceforge.net --- "really not many people care about floating point exceptions". I think more people should :) but this is indeed the context. We found this issue on a supercomputer running OpenFOAM (which can enable FP exceptions, see https://cpp.openfoam.org/v3/a02284.html), and a small simple MPI program with FP exceptions enabled. Even then it crashed in an underlying library, and not OpenFOAM itself, see https://github.com/ComputeCanada/software-stack/issues/74 In the end the combination of MPI and FP exceptions easily triggers it, but the vast majority of jobs don't crash, so even on our cluster this is very rare indeed. And many other clusters don't compile the UCX library with avx512 optimizations enabled or use precompiled binaries without those enabled.