https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930
Bug ID: 79930 Summary: Potentially Missed Optimisation for MATMUL / DOT_PRODUCT Product: gcc Version: 6.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: adam at aphirst dot karoo.co.uk Target Milestone: --- In my codebase I'm performing many "Tensor Products", which is by far the hottest routine. This is something like tp = NU^T * P * NV result [3-vector] = [4-vector]^T [4x4 "matrix" of 3-vectors] [4-vector] I implement this in three different ways 1) use an explicit do (concurrent) loop over i and j returning a 4x4 result "matrix", then respectively sum the x, y and z components of that into a single result 3-vector. 2) use three separate matmul + dot_product calls (one for x, y and z), dot_product(matmul(NU,P),NV) 3) the same, but the other way around, so dot_product(NU,matmul(P,NV)) My code is posted at https://gist.github.com/aphirst/75e0599e2d4b14d182b52daaa6a74098 and after discussing at length with JerryD and dominiq in IRC I'd like to summarise our findings. 0) There are two versions of the test code, one where the 3-vector is implemented as just a real dimension(3) member, the other as three separate %x %y and %z members. Across all tests described below, the performance difference was almost negligible, on my machine only slightly favouring the dimension(3) implementation. 1) With no optimisations, and -fcheck=all, both "Vector" implementations yield the "explicit DO" approach as being twice as slow as the matmul approach. This case is the exception, presumably as -fcheck=all is heavily penalising the explicit looping. 2) With no optimisations, and no -fcheck, both "Vector" implementations yield the "explicit DO" approach about 1.5x as fast as one matmul orientation, and very slightly slower than the other. 3) With -Og, regardless of -fcheck, both "Vector" implementations yield the "explicit DO" approach to be either twice as fast as, or 1.5x as fast as, the matmul orientations. Interestingly, the random number generation now takes an extra 15% or so longer than with no optimisations. 4) Same for -O2, also regardless of -fcheck, except the difference between the "explicit DO" and matmul approaches is slightly larger. So to summarise: * For some reason, either matmul or dot_product is missing some sort of optimisation here. Whether or not this is actually possible isn't my prerogative to say, but JerryD said that according to the tree dump, the matmul isn't being inlined. * Random number generation surely shouldn't take longer with optimisations than without, should it? --- I'm running on Arch Linux (x86_64), and gfortran -v gives: Using built-in specs. COLLECT_GCC=gfortran COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/6.3.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /build/gcc-multilib/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release Thread model: posix gcc version 6.3.1 20170109 (GCC)