Embench is used for benchmarking on embedded devices. This one project matmult-int has a function Multiply. It’s a matrix multiplication for 20 x 20 matrix. The device is a ATSAME70Q21B which is Cortex-M7 The compiler is arm branch based on GCC version 13 We are compiling with O3 which has loop-interchange pass on by default.
When we compile with -fno-loop-interchange we get all 22% back plus 5% speed up. When we do the loop interchange on the one loop nest that get interchanged it is slightly (.7%) faster. Has anyone else seen large degradation as a result of loop interchange? Thanks