https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121700

            Bug ID: 121700
           Summary: powerpc64le: Auto vectorization of modulo operation
                    does not happen
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: avinashd at gcc dot gnu.org
                CC: jskumari at gcc dot gnu.org, meissner at gcc dot gnu.org,
                    segher at gcc dot gnu.org
  Target Milestone: ---
            Target: powerpc*-*-*

Here is the simple snippet of the modulo with a constant function

void mod39(int *a) {
  for (int i=0; i<1024; i++) {
    a[i] %= 39;
  }
}

After the tree-vectorization, the output for powerpc64le with following command
gcc -S -O2 test.c -fdump-tree-vect

generates below basic block in gimple after tree-vect pass
<bb 3> [local count: 1063004408]:
# i_14 = PHI <i_11(5), 0(2)>
# ivtmp_13 = PHI <ivtmp_12(5), 1024(2)>
_1 = (long unsigned int) i_14;
_2 = _1 * 4;
_3 = a_9(D) + _2;
_4 = *_3;
_5 = _4 % 39;
*_3 = _5;
i_11 = i_14 + 1;
ivtmp_12 = ivtmp_13 - 1;
if (ivtmp_12 != 0)
  goto <bb 5>; [98.99%]
else
  goto <bb 4>; [1.01%]

which in turn generates this assembly
.L2:
        lwzu 8,4(3)
        mulhw 10,8,6
        srawi 9,8,31
        add 10,10,8
        srawi 10,10,5
        subf 9,9,10
        mulli 9,9,39
        subf 9,9,8
        stw 9,0(3)
        bdnz .L2
        blr
        .long 0
        .byte 0,0,0,0,0,0,0,0
        .cfi_endproc

But in other architectures, we see the loop getting vectorized as follows
  vect__4.6_26 = MEM <vector(4) int> [(int *)vectp_a.4_24];
  vect_patt_20.10_30 = vect__4.6_26 >> 31;
  vect_patt_17.7_27 = vect__4.6_26 h* { -770891565, -770891565, -770891565,
-770891565 };
  vect_patt_18.8_28 = vect_patt_17.7_27 + vect__4.6_26;
  vect_patt_19.9_29 = vect_patt_18.8_28 >> 5;
  vect_patt_21.11_31 = vect_patt_19.9_29 - vect_patt_20.10_30;
  vect_patt_22.12_32 = vect_patt_21.11_31 * { 39, 39, 39, 39 };
  vect_patt_23.13_33 = vect__4.6_26 - vect_patt_22.12_32;
  MEM <vector(4) int> [(int *)vectp_a.14_34] = vect_patt_23.13_33;
  i_11 = i_14 + 1;
  ivtmp_12 = ivtmp_13 - 1;
  vectp_a.4_25 = vectp_a.4_24 + 16;
  vectp_a.14_35 = vectp_a.14_34 + 16;
  ivtmp_38 = ivtmp_37 + 1;
  if (ivtmp_38 < 256)
    goto <bb 5>; [98.99%]
  else
    goto <bb 4>; [1.01%]


For powerpc64le also the loop should be auto vectorized, which would result in
the following assembly similar to below
.L2:
        lxv 45,0(3)
        xxspltib 33,31
        addi 3,3,16
        vmulhsw 0,13,11
        vsraw 1,13,1
        vadduwm 0,0,13
        vsraw 0,0,12
        vsubuwm 0,0,1
        xxspltib 33,39
        stxv 33,-16(3)
        vmuluwm 1,0,1
        lxv 34,0(3)
        vsubuwm 0,2,1
        bdnz .L2
        blr
        .long 0
        .byte 0,0,0,0,0,0,0,0
        .cfi_endproc



Which has much almost 3x faster than scalar code. But when evaluating the cost
model (scalar_single_iter_cost) for rs6000 the vectorization considers scalar
code to be faster.

Reply via email to