mask generation

munroesj at gcc dot gnu.org via Gcc-bugs Tue, 29 Oct 2024 14:00:36 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007


--- Comment #10 from Steven Munroe <munroesj at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #7)
> It is always more and slower code.  Yes.

More examples:

vui64_t
test_sld_52_v1 (vui64_t vra)
{
  vui32_t shft = vec_splat_u32(52-64);
  return vec_vsld (vra, (vui64_t) shft);
}

vui64_t
test_sld_52_v0 (vui64_t vra)
{
  return vra << 52;
}

The PowerISA is challenged to generate an vector doubleword constant. So it
seems easier to load such constants from .rodata. Again a load from .rodata is
minimum of 3 instructions and latency of 9 cycles (L1 cache hit).

But there are many examples of vector doubleowrd operations that need small
constants. Also the doubleword shift/rotate operations only require a 6-bit
shift count. Here changing the vector shift intrinsic to accept vector unsigned
char for the shift count would be helpful.

It is often faster to generate these constants from existing splat immediate
instructions and 1-2 other operations than to pay the full latency cost of a
(.rodata) vector load.

For power8 the current GCC compilers will take this option away of the library
developer. For example:

gcc-13 -O3 -mcpu=power8 -mtune=power8
00000000000001e0 <test_sld_52_v0>:                      #TL 11/11
 1e0:   00 00 4c 3c     addis   r2,r12,.TOC.@ha
 1e4:   00 00 42 38     addi    r2,r2,.TOC.@l
 1e8:   00 00 22 3d     addis   r9,r2,.rodata.cst16@ha  #L 2/2
 1ec:   00 00 29 39     addi    r9,r9,.rodata.cst16@l   #L 2/2
 1f0:   ce 48 00 7c     lvx     v0,0,r9                 #L 5/5
 1f4:   c4 05 42 10     vsld    v2,v2,v0                #L 2/2
 1f8:   20 00 80 4e     blr

00000000000001b0 <test_sld_52_v1>:                      #TL 11/11
 1e0:   00 00 4c 3c     addis   r2,r12,.TOC.@ha
 1e4:   00 00 42 38     addi    r2,r2,.TOC.@l
 1e8:   00 00 22 3d     addis   r9,r2,.rodata.cst16@ha  #L 2/2
 1ec:   00 00 29 39     addi    r9,r9,.rodata.cst16@l   #L 2/2
 1c0:   ce 48 00 7c     lvx     v0,0,r9                 #L 5/5
 1c4:   c4 05 42 10     vsld    v2,v2,v0                #L 2/2
 1c8:   20 00 80 4e     blr

While the original Power64LE support compilers would allow the library
developer to  use intrinsics to generation smaller/faster sequences. Again the
PowerISA vector shift/Rotate doubleword operations only needs the low-order
6-bits for the shift count. Here the original altivec vec_splat_u32() can
generate shift-counts for ranges 0-15 and 48-63 easily. Or if the vector
shift/rotate intrinsics would accept vector unsigned char for the shift count
the library developer could use vec_splat_u8().

gcc-6 -O3 -mcpu=power8 -mtune=power8
0000000000000170 <test_sld_52_v1>:                      #TL 4/4
 170:   8c 03 14 10     vspltisw v0,-12                 #L 2/2
 174:   c4 05 42 10     vsld    v2,v2,v0                #L 2/2
 178:   20 00 80 4e     blr

Power 9 has the advantage of VSX Vector Splat Immediate Byte and will use it
the vector inline. But this will alway insert the extend signed byte to
doubleword. The current Power Intrinsic Reference does not provide a direct
mechanism to generate xxspltib. If vec_splat_u32() is the current compiler
(constant propagation?) will convert this into the load vector (lxv this time)
from .rodata. This is still 3 instructions and 9 cycles.

gcc-13 -O3 -mcpu=power9 -mtune=power9
00000000000001a0 <test_sld_52_v0>:                      #TL 7/7
 1a0:   d1 a2 01 f0     xxspltib vs32,52                #L 3/3
 1a4:   02 06 18 10     vextsb2d v0,v0                  #L 2/2
 1a8:   c4 05 42 10     vsld    v2,v2,v0                #L 2/2
 1ac:   20 00 80 4e     blr

 0000000000000170 <test_sld_52_v1>:                     #TL 11/11
 1e0:   00 00 4c 3c     addis   r2,r12,.TOC.@ha
 1e4:   00 00 42 38     addi    r2,r2,.TOC.@l
 1e8:   00 00 22 3d     addis   r9,r2,.rodata.cst16@ha  #L 2/2
 1ec:   00 00 29 39     addi    r9,r9,.rodata.cst16@l   #L 2/2
 180:   09 00 09 f4     lxv     vs32,0(r9)              #L 5/5
 184:   c4 05 42 10     vsld    v2,v2,v0                #L 2/2
 188:   20 00 80 4e     blr

This is still larger and slower then if the compiler/intrinsic would allow the
direct use of xxspltib to generate the shift count for vsld.

gcc-fix -O3 -mcpu=power9 -mtune=power9
0000000000000170 <test_sld_52_v1>:                      #TL 5/5
 170:   d1 a2 01 f0     xxspltib vs32,52                #L 3/3
 174:   c4 05 42 10     vsld    v2,v2,v0                #L 2/2
 178:   20 00 80 4e     blr

Power10 also generates VSX Vector Splat Immediate Byte and extend sign vector
inline doubleword shift. But it again converts vec_splat_u32() intrinsic into a
load vector (plxv this time) from .rodata. This is smaller and faster then the
power9 sequence but seems a bit of overkill for the small constant (52)
involved.

gcc-13 -O3 -mcpu=power10 -mtune=power10
00000000000001d0 <test_sld_52_v0>:                      #TL 7/11
 1d0:   d1 a2 01 f0     xxspltib vs32,52                #L 3/4
 1d4:   02 06 18 10     vextsb2d v0,v0                  #L 3/4
 1d8:   c4 05 42 10     vsld    v2,v2,v0                #L 1/3
 1dc:   20 00 80 4e     blr

00000000000001b0 <test_sld_52_v1>:                      #TL 5/9
 1b0:   00 00 10 04     plxv    vs32,.rodata.cst16      #L 4/6

 1b8:   c4 05 42 10     vsld    v2,v2,v0                #L 1/3
 1bc:   20 00 80 4e     blr

Both are larger and slower then if the compiler/intrinsic would allow the
direct use of xxspltib to generate the shift count for vsld.

gcc-fix -O3 -mcpu=power10 -mtune=power10
0000000000000170 <test_sld_52_v1>:                      #TL 4/7
 170:   d1 a2 01 f0     xxspltib vs32,52                #L 3/4
 174:   c4 05 42 10     vsld    v2,v2,v0                #L 1/3
 178:   20 00 80 4e     blr

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

Reply via email to