https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117251
Bug ID: 117251 Summary: SHA3 code for PowerPC has a major slow down Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: meissner at gcc dot gnu.org Target Milestone: --- Created attachment 59405 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59405&action=edit Multibuff.c test The sha3 functions compiled for the powerpc has a slowdown in GCC 15 and GCC 14 compared to GCC 13 and GCC 12 when compiled for power10, due to excessive amounts of spilling. The main function for multibuf.c has 3,747 lines, all of which are using vector unsigned long long. There are 696 vector shifts (all shifts are constant), 1,824 vector xor's and 600 vector andc's. The timing for these runs is the following: Trunk (sources checked out October 5th): 6.15 seconds GCC 14 (sources checked out October 21st): 6.28 seconds GCC 13 (sources checked out October 21st): 5.57 seconds GCC 12 (sources checked out October 21st): 5.61 seconds GCC 11 (sources checked out October 21st): 9.56 seconds In looking at it, the main thing that steps out is the reason for either spilling or moving variables is the support in gcc/rs6000/fusion.md (generated by gcc/rs6000/genfusion.pl) that tries to fuse the vec_andc feeding into vec_xor, and other vec_xor's feeding into vec_xor. On the powerpc for power10, there is a special fusion mode that happens if the machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction and the VANDC/VXOR feeds into the 2nd VXOR instruction. While the Power10 has 64 vector registers (which uses the XXL prefix to do the logical operation), the fusion only works with the older Altivec instruction set (which uses the V prefix). The Altivec instruction only has 32 vector registers (which are overlaid over the VSX vector registers 32-63). By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this fusion, it means that the register allocator has more register pressure for the traditional Altivec registers instead of the VSX registers. In addition, since there are vector shifts, these shifts only work on the traditional Altivec registers, which adds to the Altivec register pressure. Finally loading up the vector constants for the shifts requires Altivec registers (using XXSPLTIB and VEXTSB2D to form the constant). But this doesn't add to the register pressure, since these constants are all used in the VRLD vector shift instruction. Here are some summaries for the various compilers: Trunk GCC14 GCC13 GCC12 GCC11 ----- ----- ----- ----- ----- Fuse VANDC -> VXOR 600 600 600 600 600 Fuse VXOR -> VXOR 240 240 120 120 120 Spill vector to stack 364 364 172 184 110 Load spilled vector from stack 962 962 713 723 166 Vector moves 100 100 70 72 3,055 Vector shift right 696 696 696 696 696 XXLANDC or VANDC 600 600 600 600 600 XXLXOR or VXOR 1,824 1,824 1,824 1,824 1,825 XXSPLTIB and VEXTSB2D to load constants 24 24 24 24 24 This means that current trunk and GCC 14 have more vector spills and loads than GCC 13 and GCC 12. In addition, they have some more vector moves. Current trunk and GCC 14-12 have more vector spills than GCC 11, but GCC 11 has many more vector moves that the other compilers. Thus even though it has way less spills, the vector moves are why GCC 11 has the slowest results.