https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121700
Bug ID: 121700 Summary: powerpc64le: Auto vectorization of modulo operation does not happen Product: gcc Version: 16.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: avinashd at gcc dot gnu.org CC: jskumari at gcc dot gnu.org, meissner at gcc dot gnu.org, segher at gcc dot gnu.org Target Milestone: --- Target: powerpc*-*-* Here is the simple snippet of the modulo with a constant function void mod39(int *a) { for (int i=0; i<1024; i++) { a[i] %= 39; } } After the tree-vectorization, the output for powerpc64le with following command gcc -S -O2 test.c -fdump-tree-vect generates below basic block in gimple after tree-vect pass <bb 3> [local count: 1063004408]: # i_14 = PHI <i_11(5), 0(2)> # ivtmp_13 = PHI <ivtmp_12(5), 1024(2)> _1 = (long unsigned int) i_14; _2 = _1 * 4; _3 = a_9(D) + _2; _4 = *_3; _5 = _4 % 39; *_3 = _5; i_11 = i_14 + 1; ivtmp_12 = ivtmp_13 - 1; if (ivtmp_12 != 0) goto <bb 5>; [98.99%] else goto <bb 4>; [1.01%] which in turn generates this assembly .L2: lwzu 8,4(3) mulhw 10,8,6 srawi 9,8,31 add 10,10,8 srawi 10,10,5 subf 9,9,10 mulli 9,9,39 subf 9,9,8 stw 9,0(3) bdnz .L2 blr .long 0 .byte 0,0,0,0,0,0,0,0 .cfi_endproc But in other architectures, we see the loop getting vectorized as follows vect__4.6_26 = MEM <vector(4) int> [(int *)vectp_a.4_24]; vect_patt_20.10_30 = vect__4.6_26 >> 31; vect_patt_17.7_27 = vect__4.6_26 h* { -770891565, -770891565, -770891565, -770891565 }; vect_patt_18.8_28 = vect_patt_17.7_27 + vect__4.6_26; vect_patt_19.9_29 = vect_patt_18.8_28 >> 5; vect_patt_21.11_31 = vect_patt_19.9_29 - vect_patt_20.10_30; vect_patt_22.12_32 = vect_patt_21.11_31 * { 39, 39, 39, 39 }; vect_patt_23.13_33 = vect__4.6_26 - vect_patt_22.12_32; MEM <vector(4) int> [(int *)vectp_a.14_34] = vect_patt_23.13_33; i_11 = i_14 + 1; ivtmp_12 = ivtmp_13 - 1; vectp_a.4_25 = vectp_a.4_24 + 16; vectp_a.14_35 = vectp_a.14_34 + 16; ivtmp_38 = ivtmp_37 + 1; if (ivtmp_38 < 256) goto <bb 5>; [98.99%] else goto <bb 4>; [1.01%] For powerpc64le also the loop should be auto vectorized, which would result in the following assembly similar to below .L2: lxv 45,0(3) xxspltib 33,31 addi 3,3,16 vmulhsw 0,13,11 vsraw 1,13,1 vadduwm 0,0,13 vsraw 0,0,12 vsubuwm 0,0,1 xxspltib 33,39 stxv 33,-16(3) vmuluwm 1,0,1 lxv 34,0(3) vsubuwm 0,2,1 bdnz .L2 blr .long 0 .byte 0,0,0,0,0,0,0,0 .cfi_endproc Which has much almost 3x faster than scalar code. But when evaluating the cost model (scalar_single_iter_cost) for rs6000 the vectorization considers scalar code to be faster.