[Bug tree-optimization/88713] Vectorized code slow vs. flang

elrodc at gmail dot com Sun, 06 Jan 2019 20:07:26 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713


--- Comment #14 from Chris Elrod <elrodc at gmail dot com> ---
It's not really reproducible across runs:

$ time ./gfortvectests 
 Transpose benchmark completed in   22.7010765    
 SIMD benchmark completed in   1.37529969    
 All are equal: F
 All are approximately equal: F
 Maximum relative error   6.20566949E-04
 First record X:  0.188879877      0.377619117      -1.67841911E-02
 First record Xt:  0.188880071      0.377619147      -1.67841911E-02
 Second record X:  -8.14126506E-02 -0.421755224     -0.199057430    
 Second record Xt:  -8.14126655E-02 -0.421755224     -0.199057430    

real    0m2.414s
user    0m2.406s
sys     0m0.005s

$ time ./flangvectests 
 Transpose benchmark completed in    7.630980    
 SIMD benchmark completed in   0.6455200    
 All are equal:  F
 All are approximately equal:  F
 Maximum relative error   2.0917827E-04
 First record X:   0.5867542        1.568364       0.1006735    
 First record Xt:   0.5867541        1.568363       0.1006735    
 Second record X:   0.2894785      -0.1510675      -9.3419194E-02
 Second record Xt:   0.2894785      -0.1510675      -9.3419187E-02

real    0m0.839s
user    0m0.832s
sys     0m0.006s

$ time ./gfortvectests 
 Transpose benchmark completed in   22.0195961    
 SIMD benchmark completed in   1.36087596    
 All are equal: F
 All are approximately equal: F
 Maximum relative error   2.49150675E-04
 First record X: -0.284217566       2.13768221E-02 -0.475293010    
 First record Xt: -0.284217596       2.13767942E-02 -0.475293040    
 Second record X:   1.75664220E-02  -9.29893106E-02  -4.37139049E-02
 Second record Xt:   1.75664220E-02  -9.29893106E-02  -4.37139049E-02

real    0m2.344s
user    0m2.338s
sys     0m0.003s

$ time ./flangvectests 
 Transpose benchmark completed in    7.881181    
 SIMD benchmark completed in   0.6132510    
 All are equal:  F
 All are approximately equal:  F
 Maximum relative error   2.0917827E-04
 First record X:   0.5867542        1.568364       0.1006735    
 First record Xt:   0.5867541        1.568363       0.1006735    
 Second record X:   0.2894785      -0.1510675      -9.3419194E-02
 Second record Xt:   0.2894785      -0.1510675      -9.3419187E-02

real    0m0.861s
user    0m0.853s
sys     0m0.006s


It's also probably wasn't quite right to call it "error", because it's
comparing the values from the scalar and vectorized versions. Although it is
unsettling if the differences are high; there should be an exact match,
ideally.

Back to Julia, using mpfr (set to 252 bits of precision), and rounding to
single precision for an exactly rounded answer...

X32gfort # calculated from gfortran
X32flang # calculated from flang
Xbf      # mpfr, 252-bit precision ("BigFloat" in Julia)

julia> Xbf32 = Float32.(Xbf) # correctly rounded result

julia> function ULP(x, correct) # calculates ULP error
           x == correct && return 0
           if x < correct
               error = 1
               while nextfloat(x, error) != correct
                   error += 1
               end
           else
               error = 1
               while prevfloat(x, error) != correct
                   error += 1
               end
           end
           error
       end
ULP (generic function with 1 method)

julia> ULP.(X32gfort, Xbf32)'
3×1024 Adjoint{Int64,Array{Int64,2}}:
 7  1  1  8  3  2  1  1  1  27  4  1  4  6  0  0  2  0  2  4  0  7  1  1  3  8 
4  2  2  …  1  0  2  0  0  1  2  3  1  5  1  1  0  0  0  2  3  2  1  2  3  1  0
 1  1  0  2  0  41
 4  2  1  1  6  1  0  1  1   2  2  0  0  3  0  1  0  3  1  1  0  1  1  0  0  3 
1  0  0     0  1  0  1  0  1  0  1  1  4  1  1  0  2  0  1  0  1  0  0  0  1  2
 1  1  1  0  0   1
 1  1  0  1  1  0  0  0  0   1  1  0  0  1  0  1  1  1  0  1  1  0  0  1  0  1 
0  0  0     0  0  1  0  0  0  0  0  1  0  0  1  1  1  0  0  1  0  1  1  0  1  1
 0  0  0  0  0   1

julia> mean(ans)
1.9462890625

julia> ULP.(X32flang, Xbf32)'
3×1024 Adjoint{Int64,Array{Int64,2}}:
 4  1  0  3  0  0  0  1  1  5  2  1  1  6  3  0  1  0  0  1  1  21  0  1  2  8 
2  3  0  0  …  1  1  1  15  2  1  1  5  1  1  1  0  0  0  0  0  2  1  3  1  1 
1  1  1  1  1  0  11
 3  1  1  0  1  0  0  1  0  0  1  0  0  2  1  1  1  6  0  0  0   2  1  0  1  4 
1  1  0  3     1  1  1   1  2  1  1  0  1  1  0  0  1  0  1  0  0  1  0  0  1 
1  1  0  1  0  0   0
 1  0  1  0  0  0  1  1  0  1  0  0  0  1  1  0  0  1  1  0  1   1  0  1  0  1 
0  0  1  0     0  0  1   0  0  0  0  0  0  2  0  0  0  0  0  1  1  1  1  0  1 
0  0  0  0  0  0   1

julia> mean(ans)
1.3388671875


So in that case, gfortran's version had about 1.95 ULP error on average, and
Flang about 1.34 ULP error.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

Reply via email to