https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #12 from Chris Elrod <elrodc at gmail dot com> --- Created attachment 45363 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45363&action=edit Fortran program for running benchmarks. Okay, thank you. I attached a Fortran program you can run to benchmark the code. It randomly generates valid inputs, and then times running the code 10^5 times. Finally, it reports the average time in microseconds. The SIMD times are the vectorized version, and the transposed times are the non-vectorized versions. In both cases, Flang produces much faster code. The results seem in line with what I got benchmarking shared libraries from Julia. I linked rt for access to the high resolution clock. $ gfortran -Ofast -lrt -march=native -mprefer-vector-width=512 vectorization_tests.F90 -o gfortvectests $ time ./gfortvectests Transpose benchmark completed in 22.7799759 SIMD benchmark completed in 1.34003162 All are equal: F All are approximately equal: F Maximum relative error 8.27204276E-05 First record X: 1.02466011 -0.689792156 -0.404027045 First record Xt: 1.02465975 -0.689791918 -0.404026985 Second record X: -0.546353579 3.37308086E-03 1.15257287 Second record Xt: -0.546353400 3.37312138E-03 1.15257275 real 0m2.418s user 0m2.412s sys 0m0.003s $ flang -Ofast -lrt -march=native -mprefer-vector-width=512 vectorization_tests.F90 -o flangvectests $ time ./flangvectests Transpose benchmark completed in 7.232568 SIMD benchmark completed in 0.6596010 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.5867542 1.568364 0.1006735 First record Xt: 0.5867541 1.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real 0m0.801s user 0m0.794s sys 0m0.005s