https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117874

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note m_mat_na.c and m_mat_nn.c are completely unrolled instead and not
vectorized by GCC 14 (nor trunk), still slower as reported (mul_su3_na/nn).
trunk seems to unroll less,

m_mat_nn.c:73:30: optimized: loop with 3 iterations completely unrolled (header
execution count 268435456)

on trunk vs.

m_mat_nn.c:73:30: optimized: loop with 3 iterations completely unrolled (header
execution count 268435456)
m_mat_nn.c:73:14: optimized: loop with 2 iterations completely unrolled (header
execution count 89478486)

on branch.  In particular cunroll on GIMPLE does not unroll the outer loop on
trunk:

Loop 1 iterates 2 times.
Loop 1 iterates at most 2 times.
Loop 1 likely iterates at most 2 times.
size: 104-4, last_iteration: 104-4
  Loop size: 104
  Estimated size after unrolling: 300
Not unrolling loop 1: number of insns in the unrolled sequence reaches --param
max-completely-peeled-insns limit.
Not peeling: upper bound is known so can unroll completely

vs branch:

Loop 1 iterates 2 times.
Loop 1 iterates at most 2 times.
Loop 1 likely iterates at most 2 times.
size: 104-4, last_iteration: 104-4
  Loop size: 104
  Estimated size after unrolling: 200

that's the r15-919-gef27b91b62c3aa change I think.  The heuristic, while
careful, doesn't accurately remember what's "innermost" in this case
though it's still correct that the body isn't simplified by 1/3 - in
this case cunroll has 306 stmts while optimized 276 (FMA disabled),
so that's purely CSE.


Testcase:

typedef struct {
    double real;
    double imag;
} complex;
typedef struct { complex e[3][3]; } su3_matrix;
void mult_su3_nn( su3_matrix *a, su3_matrix *b, su3_matrix *c )
{
  int i,j;
  double t,ar,ai,br,bi,cr,ci;
  for(i=0;i<3;i++)for(j=0;j<3;j++){

      ar=a->e[i][0].real; ai=a->e[i][0].imag;
      br=b->e[0][j].real; bi=b->e[0][j].imag; 
      cr=ar*br; t=ai*bi; cr -= t;
      ci=ar*bi; t=ai*br; ci += t;

      ar=a->e[i][1].real; ai=a->e[i][1].imag; 
      br=b->e[1][j].real; bi=b->e[1][j].imag;
      t=ar*br; cr += t; t=ai*bi; cr -= t; 
      t=ar*bi; ci += t; t=ai*br; ci += t;

      ar=a->e[i][2].real; ai=a->e[i][2].imag;
      br=b->e[2][j].real; bi=b->e[2][j].imag;
      t=ar*br; cr += t; t=ai*bi; cr -= t;
      t=ar*bi; ci += t; t=ai*br; ci += t;

      c->e[i][j].real=cr;
      c->e[i][j].imag=ci;
  }
}

Reply via email to