https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767
Bug ID: 88767 Summary: 'unroll and jam' not optimizing some loops Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: helijia at gcc dot gnu.org Target Milestone: --- The test source is as follows: __attribute__((noinline)) void calculate(const double* __restrict__ A, const double* __restrict__ B, double* __restrict__ C) { unsigned int l_m = 0; unsigned int l_n = 0; unsigned int l_k = 0; A = (const double*)__builtin_assume_aligned(A,16); B = (const double*)__builtin_assume_aligned(B,16); C = (double*)__builtin_assume_aligned(C,16); for ( l_n = 0; l_n < 9; l_n++ ) { // loop 1 for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } // loop 2 for ( l_k = 0; l_k < 17; l_k++ ) { // loop 3 for ( l_m = 0; l_m < 10; l_m++ ) { // loop 4 C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; } } } } #define SIZE 36 double A[SIZE][SIZE] __attribute__((aligned(16))); double B[SIZE][SIZE] __attribute__((aligned(16))); double C[SIZE][SIZE] __attribute__((aligned(16))); int main() { long r, i, j; for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) { A[i][j] = 1.0; B[i][j] = 2.0; C[i][j] = 3.0; } } for (r=0; r < 1000000; r++) { calculate(&A[0][0],&B[0][0], &C[0][0]); } return 0; } First, I compile the test case with the following command. g++ unroll_jam_bug.cpp -O3 -funroll-loops -floop-unroll-and-jam -o unroll_jam_bug -fdump-tree-unrolljam-details. In the generated file of unroll_jam_bug.cpp.143t.unrolljam, I found that there is no unroll and jam optimization for the loop in the calculate function. Second, I added the -fdump-tree-all parameter to the command line. I found that the innermost loop(loop 3 and 4) is completely unrolled because pass_data_complete_unrolli pass thinks innermost loop is small. As the inner loop is fully expanded, the original loop becomes large. When the loop is expanded in the pass_loop_jam pass, the number of unroll_factor * loop instruction > 200 will be judged. If the result is true, the optimization will be abandoned. Otherwise, the optimization will proceed. By the second analysis, I tried to ban the unrolli optimization.So I use the following command line. g++ unroll_jam_bug.cpp -O3 -mcpu=power8 -fdisable-tree-cunrolli -floop-unroll-and-jam -o unroll_jam_bug -fdump-tree-unrolljam-details
Using this command, loop unroll and jam optimization will be executed, but there seems to be room for optimization. 
Original code: for ( l_n = 0; l_n < 9; l_n++ ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } for ( l_k = 0; l_k < 17; l_k++ ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; } } } After unroll and jam pass: for ( l_n = 0; l_n < 9; l_n++ ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] = 0.0; } for ( l_k = 0; l_k < 17; l_k += 2 ) { for ( l_m = 0; l_m < 10; l_m++ ) { C[(l_n*10)+l_m] += A[(l_k*20)+l_m] * B[(l_n*20)+l_k]; C[(l_n*10)+l_m] += A[(l_k*20 + 20)+l_m] * B[(l_n*20)+l_k + 1]; } } }