https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Also, I wonder why vect_recog_popcount_pattern handles only popcount, can't it
handle clz/ctz as well?
I mean for
void
foo (long long *p, long long *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = __builtin_popcountll (q[i]);
}

void
bar (long long *p, long long *q)
{
  for (int i = 0; i < 2048; ++i)
    p[i] = __builtin_clzll (q[i]);
}
with -O3 -mavx512{bw,cd,vl,dq,bitalg,vpopcntdq} we have in *.optimized in the
inner loop nice:
  vect__4.7_40 = MEM <vector(8) long long int> [(long long int *)q_12(D) +
ivtmp.25_1 * 1];
  vect_patt_24.8_41 = .POPCOUNT (vect__4.7_40);
  MEM <vector(8) long long int> [(long long int *)p_13(D) + ivtmp.25_1 * 1] =
vect_patt_24.8_41;
but in the other loop
  vect__4.36_39 = MEM <vector(8) long long int> [(long long int *)q_12(D) +
ivtmp.56_1 * 1];
  vect__4.37_41 = MEM <vector(8) long long int> [(long long int *)q_12(D) + 64B
+ ivtmp.56_1 * 1];
  vect__5.38_42 = VIEW_CONVERT_EXPR<vector(8) long long unsigned
int>(vect__4.36_39);
  vect__5.38_43 = VIEW_CONVERT_EXPR<vector(8) long long unsigned
int>(vect__4.37_41);
  _44 = .CLZ (vect__5.38_42);
  _45 = .CLZ (vect__5.38_43);
  vect__6.39_46 = VEC_PACK_TRUNC_EXPR <_44, _45>;
  vect__8.40_47 = [vec_unpack_lo_expr] vect__6.39_46;
  vect__8.40_48 = [vec_unpack_hi_expr] vect__6.39_46;
  MEM <vector(8) long long int> [(long long int *)p_13(D) + ivtmp.56_1 * 1] =
vect__8.40_47;
  MEM <vector(8) long long int> [(long long int *)p_13(D) + 64B + ivtmp.56_1 *
1] = vect__8.40_48;
So, we need to handle twice as many vectors regardless of unrolling, perform
twice vector V8DI->V8DI clz, then pack it and immediately unpack it again.

Reply via email to