https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #14 from Chris Elrod <elrodc at gmail dot com> --- It's not really reproducible across runs: $ time ./gfortvectests Transpose benchmark completed in 22.7010765 SIMD benchmark completed in 1.37529969 All are equal: F All are approximately equal: F Maximum relative error 6.20566949E-04 First record X: 0.188879877 0.377619117 -1.67841911E-02 First record Xt: 0.188880071 0.377619147 -1.67841911E-02 Second record X: -8.14126506E-02 -0.421755224 -0.199057430 Second record Xt: -8.14126655E-02 -0.421755224 -0.199057430 real 0m2.414s user 0m2.406s sys 0m0.005s $ time ./flangvectests Transpose benchmark completed in 7.630980 SIMD benchmark completed in 0.6455200 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.5867542 1.568364 0.1006735 First record Xt: 0.5867541 1.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real 0m0.839s user 0m0.832s sys 0m0.006s $ time ./gfortvectests Transpose benchmark completed in 22.0195961 SIMD benchmark completed in 1.36087596 All are equal: F All are approximately equal: F Maximum relative error 2.49150675E-04 First record X: -0.284217566 2.13768221E-02 -0.475293010 First record Xt: -0.284217596 2.13767942E-02 -0.475293040 Second record X: 1.75664220E-02 -9.29893106E-02 -4.37139049E-02 Second record Xt: 1.75664220E-02 -9.29893106E-02 -4.37139049E-02 real 0m2.344s user 0m2.338s sys 0m0.003s $ time ./flangvectests Transpose benchmark completed in 7.881181 SIMD benchmark completed in 0.6132510 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.5867542 1.568364 0.1006735 First record Xt: 0.5867541 1.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real 0m0.861s user 0m0.853s sys 0m0.006s It's also probably wasn't quite right to call it "error", because it's comparing the values from the scalar and vectorized versions. Although it is unsettling if the differences are high; there should be an exact match, ideally. Back to Julia, using mpfr (set to 252 bits of precision), and rounding to single precision for an exactly rounded answer... X32gfort # calculated from gfortran X32flang # calculated from flang Xbf # mpfr, 252-bit precision ("BigFloat" in Julia) julia> Xbf32 = Float32.(Xbf) # correctly rounded result julia> function ULP(x, correct) # calculates ULP error x == correct && return 0 if x < correct error = 1 while nextfloat(x, error) != correct error += 1 end else error = 1 while prevfloat(x, error) != correct error += 1 end end error end ULP (generic function with 1 method) julia> ULP.(X32gfort, Xbf32)' 3×1024 Adjoint{Int64,Array{Int64,2}}: 7 1 1 8 3 2 1 1 1 27 4 1 4 6 0 0 2 0 2 4 0 7 1 1 3 8 4 2 2 … 1 0 2 0 0 1 2 3 1 5 1 1 0 0 0 2 3 2 1 2 3 1 0 1 1 0 2 0 41 4 2 1 1 6 1 0 1 1 2 2 0 0 3 0 1 0 3 1 1 0 1 1 0 0 3 1 0 0 0 1 0 1 0 1 0 1 1 4 1 1 0 2 0 1 0 1 0 0 0 1 2 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 1 julia> mean(ans) 1.9462890625 julia> ULP.(X32flang, Xbf32)' 3×1024 Adjoint{Int64,Array{Int64,2}}: 4 1 0 3 0 0 0 1 1 5 2 1 1 6 3 0 1 0 0 1 1 21 0 1 2 8 2 3 0 0 … 1 1 1 15 2 1 1 5 1 1 1 0 0 0 0 0 2 1 3 1 1 1 1 1 1 1 0 11 3 1 1 0 1 0 0 1 0 0 1 0 0 2 1 1 1 6 0 0 0 2 1 0 1 4 1 1 0 3 1 1 1 1 2 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 julia> mean(ans) 1.3388671875 So in that case, gfortran's version had about 1.95 ULP error on average, and Flang about 1.34 ULP error.