117251: Add PowerPC XXEVAL support to speed up SHA3 calculations

Michael Meissner Wed, 11 Jun 2025 12:05:07 -0700

History: This is version 2 of the patch.  In the original patch, all 44 fusion
opportunities were lumped together in one patch.  Outside of fusion.md, these
changes are fairly small, in that it adds one alternative to each of the fusion
patterns to add xxeval support.  Fusion.md is a generated file (created from
genfusion.md) that does all of the fusion combinations.  Because of these
automated changes, fusion.md had 265 lines that were deleted and 397 lines that
were added.


In version 2 of the patch, I broke the original patch into 45 separate patches.
The first patch adds the basic support to genfusion.pl, predicates.md, rs6000.h,
and rs6000.md.  The first patch adds the first fusion case (vector 'AND' fusing
into vector 'AND'). The next 43 patches each add one more fusion case.  Then the
last case adds the two test cases.

The multibuff.c benchmark attached to the PR target/117251 compiled for Power10
PowerPC that implement SHA3 has a slowdown in the current trunk and GCC 14
compared to GCC 11 - GCC 13, due to excessive amounts of spilling.

The main function for the multibuf.c file has 3,747 lines, all of which are
using vector unsigned long long.  There are 696 vector rotates (all rotates are
constant), 1,824 vector xor's and 600 vector andc's.

In looking at it, the main thing that steps out is the reason for either
spilling or moving variables is the support in fusion.md (generated by
genfusion.pl) that tries to fuse the vec_andc feeding into vec_xor, and other
vec_xor's feeding into vec_xor.

On the powerpc for power10, there is a special fusion mode that happens if the
machine has a VANDC or VXOR instruction that is adjacent to a VXOR instruction
and the VANDC/VXOR feeds into the 2nd VXOR instruction.

While the Power10 has 64 vector registers (which uses the XXL prefix to do
logical operations), the fusion only works with the older Altivec instruction
set (which uses the V prefix).  The Altivec instruction only has 32 vector
registers (which are overlaid over the VSX vector registers 32-63).

By having the combiner patterns fuse_vandc_vxor and fuse_vxor_vxor to do this
fusion, it means that the register allocator has more register pressure for the
traditional Altivec registers instead of the VSX registers.

In addition, since there are vector rotates, these rotates only work on the
traditional Altivec registers, which adds to the Altivec register pressure.

Finally in addition to doing the explicit xor, andc, and rotates using the
Altivec registers, we have to also load vector constants for the rotate amount
and these registers also are allocated as Altivec registers.

Current trunk and GCC 12-14 have more vector spills than GCC 11, but GCC 11 has
many more vector moves that the later compilers.  Thus even though it has way
less spills, the vector moves are why GCC 11 have the slowest results.

There is an instruction that was added in power10 (XXEVAL) that does provide
fusion between VSX vectors that includes ANDC->XOR and XOR->XOR fusion.

The latency of XXEVAL is slightly more than the fused VANDC/VXOR or VXOR/VXOR,
so I have written the patch to prefer doing the Altivec instructions if they
don't need a temporary register.

Here are the results for adding support for XXEVAL for the multibuff.c
benchmark attached to the PR.  Note that we essentially recover the speed with
this patch that were lost with GCC 14 and the current trunk:

                               XXEVAL   Trunk   GCC15   GCC14    GCC13   GCC12
                               ------   -----   -----   -----    -----   -----
Multibuf time in seconds        5.600   6.151   6.129   6.053    5.539   5.598
XXEVAL improvement percentage     ---   +9.8%   +9.4%   +8.1%    -1.1%      0%

Fuse VANDC -> VXOR                209     600      600    600     600      600
Fuse VXOR -> VXOR                   0     241      241    240     120      120
XXEVAL to fuse ANDC -> XOR (#45)  391       0        0      0       0        0
XXEVAL to fuse XOR -> XOR (#105)  240       0        0      0       0        0

Spill vector to stack             140     417      417     403    226      239
Load spilled vector from stack    490   1,012    1,012   1,000    766      782
Vector moves                        8      93      100      70     72       72

XXLANDC or VANDC                  209     600      600     600    600      600
XXLXOR or VXOR                    953   1,824    1,824   1,824  1,824    1,825
XXEVAL                            631       0        0       0      0        0


Here are the results for adding support for XXEVAL for the singlebuff.c
benchmark attached to the PR.  Note that adding XXEVAL greatly speeds up this
particular benchmark:

                               XXEVAL   Trunk   GCC15   GCC14    GCC13   GCC12
                               ------   -----   -----   -----    -----   -----
Singlebuf time in seconds       4.429   5.330   5.333   5.315    5.270   5.278
XXEVAL improvement percentage     ---  +20.3%  +20.4%  +20.0%   +19.0%  +19.2%

Fuse VANDC -> VXOR                210     600     600     600      600     600
Fuse VXOR -> VXOR                   0     240     240     240      120     120
XXEVAL to fuse ANDC -> XOR (#45)  390       0       0       0        0       0
XXEVAL to fuse XOR -> XOR (#105)  240       0       0       0        0       0

Spill vector to stack             134     388     388     388      391     391
Load spilled vector from stack    357     808     808     808      769     769
Vector moves                       34      80      80      80      119     119

XXLANDC or VANDC                  210     600     600     600      600     600
XXLXOR or VXOR                    954   1,824   1,824   1,824    1,824   1,824
XXEVAL                            630       0       0       0        0       0


These patches add the following fusion patterns:

        xxland  => xxland       xxlandc => xxland       xxlxor  => xxland
        xxlor   => xxland       xxlnor  => xxland       xxleqv  => xxland
        xxlorc  => xxland       xxlandc => xxlandc      xxlnand => xxland
        xxlnand => xxlnor       xxland  => xxlxor       xxland  => xxlor
        xxlandc => xxlxor       xxlandc => xxlor        xxlorc  => xxlnor
        xxlorc  => xxleqv       xxlorc  => xxlorc       xxleqv  => xxlnor
        xxlxor  => xxlxor       xxlxor  => xxlor        xxlnor  => xxlnor
        xxlor   => xxlxor       xxlor   => xxlor        xxlor   => xxlnor
        xxlnor  => xxlxor       xxlnor  => xxlor        xxlxor  => xxlnor
        xxleqv  => xxlxor       xxleqv  => xxlor        xxlorc  => xxlxor
        xxlorc  => xxlor        xxlandc => xxlnor       xxlandc => xxleqv
        xxland  => xxlnor       xxlnand => xxlxor       xxlnand => xxlor
        xxlnand => xxlnand      xxlorc  => xxlnand      xxleqv  => xxlnand
        xxlnor  => xxlnand      xxlor   => xxlnand      xxlxor  => xxlnand
        xxlandc => xxlnand      xxland  => xxlnand

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com

[PATCH 0/45, V2] PR target/117251: Add PowerPC XXEVAL support to speed up SHA3 calculations

Reply via email to