https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117251

            Bug ID: 117251
           Summary: SHA3 code for PowerPC has a major slow down
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: major
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: meissner at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59405
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59405&action=edit
Multibuff.c test

The sha3 functions compiled for the powerpc has a slowdown in GCC 15 and GCC 14
compared to GCC 13 and GCC 12 when compiled for power10, due to excessive
amounts of spilling.

The main function for multibuf.c has 3,747 lines, all of which are using vector
unsigned long long.  There are 696 vector shifts (all shifts are constant),
1,824 vector xor's and 600 vector andc's.

The timing for these runs is the following:

Trunk  (sources checked out October 5th):    6.15 seconds
GCC 14 (sources checked out October 21st):   6.28 seconds
GCC 13 (sources checked out October 21st):   5.57 seconds
GCC 12 (sources checked out October 21st):   5.61 seconds
GCC 11 (sources checked out October 21st):   9.56 seconds

In looking at it, the main thing that steps out is the reason for either
spilling or moving variables is the support in gcc/rs6000/fusion.md (generated
by gcc/rs6000/genfusion.pl) that tries to fuse the vec_andc feeding into
vec_xor, and other vec_xor's feeding into vec_xor.

On the powerpc for power10, there is a special fusion mode that happens if the
machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction
and the VANDC/VXOR feeds into the 2nd VXOR instruction.

While the Power10 has 64 vector registers (which uses the XXL prefix to do the
logical operation), the fusion only works with the older Altivec instruction
set (which uses the V prefix).  The Altivec instruction only has 32 vector
registers (which are overlaid over the VSX vector registers 32-63).

By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this
fusion, it means that the register allocator has more register pressure for the
traditional Altivec registers instead of the VSX registers.

In addition, since there are vector shifts, these shifts only work on the
traditional Altivec registers, which adds to the Altivec register pressure.

Finally loading up the vector constants for the shifts requires Altivec
registers (using XXSPLTIB and VEXTSB2D to form the constant).  But this doesn't
add to the register pressure, since these constants are all used in the VRLD
vector shift instruction.

Here are some summaries for the various compilers:

                                        Trunk   GCC14   GCC13   GCC12   GCC11
                                        -----   -----   -----   -----   -----
Fuse VANDC -> VXOR                        600     600     600     600     600
Fuse VXOR -> VXOR                         240     240     120     120     120

Spill vector to stack                     364     364     172     184     110
Load spilled vector from stack            962     962     713     723     166
Vector moves                              100     100      70      72   3,055

Vector shift right                        696     696     696     696     696
XXLANDC or VANDC                          600     600     600     600     600
XXLXOR or VXOR                          1,824   1,824   1,824   1,824   1,825

XXSPLTIB and VEXTSB2D to load constants    24      24      24      24      24

This means that current trunk and GCC 14 have more vector spills and loads than
GCC 13 and GCC 12.  In addition, they have some more vector moves.

Current trunk and GCC 14-12 have more vector spills than GCC 11, but GCC 11 has
many more vector moves that the other compilers.  Thus even though it has way
less spills, the vector moves are why GCC 11 has the slowest results.

Reply via email to