https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011
--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Also, I wonder why vect_recog_popcount_pattern handles only popcount, can't it handle clz/ctz as well? I mean for void foo (long long *p, long long *q) { for (int i = 0; i < 2048; ++i) p[i] = __builtin_popcountll (q[i]); } void bar (long long *p, long long *q) { for (int i = 0; i < 2048; ++i) p[i] = __builtin_clzll (q[i]); } with -O3 -mavx512{bw,cd,vl,dq,bitalg,vpopcntdq} we have in *.optimized in the inner loop nice: vect__4.7_40 = MEM <vector(8) long long int> [(long long int *)q_12(D) + ivtmp.25_1 * 1]; vect_patt_24.8_41 = .POPCOUNT (vect__4.7_40); MEM <vector(8) long long int> [(long long int *)p_13(D) + ivtmp.25_1 * 1] = vect_patt_24.8_41; but in the other loop vect__4.36_39 = MEM <vector(8) long long int> [(long long int *)q_12(D) + ivtmp.56_1 * 1]; vect__4.37_41 = MEM <vector(8) long long int> [(long long int *)q_12(D) + 64B + ivtmp.56_1 * 1]; vect__5.38_42 = VIEW_CONVERT_EXPR<vector(8) long long unsigned int>(vect__4.36_39); vect__5.38_43 = VIEW_CONVERT_EXPR<vector(8) long long unsigned int>(vect__4.37_41); _44 = .CLZ (vect__5.38_42); _45 = .CLZ (vect__5.38_43); vect__6.39_46 = VEC_PACK_TRUNC_EXPR <_44, _45>; vect__8.40_47 = [vec_unpack_lo_expr] vect__6.39_46; vect__8.40_48 = [vec_unpack_hi_expr] vect__6.39_46; MEM <vector(8) long long int> [(long long int *)p_13(D) + ivtmp.56_1 * 1] = vect__8.40_47; MEM <vector(8) long long int> [(long long int *)p_13(D) + 64B + ivtmp.56_1 * 1] = vect__8.40_48; So, we need to handle twice as many vectors regardless of unrolling, perform twice vector V8DI->V8DI clz, then pack it and immediately unpack it again.