mask generation

munroesj at gcc dot gnu.org via Gcc-bugs Tue, 29 Oct 2024 12:34:47 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007


--- Comment #9 from Steven Munroe <munroesj at gcc dot gnu.org> ---
(In reply to Segher Boessenkool from comment #7)
> It is always more and slower code.  Yes.

lets try some specific examples and examine the code generated for power8/9/10

vui32_t
test_slw_23_v0 (vui32_t vra)
{
  return vra << 23;
}

vui32_t
test_slw_23_v1 (__vector unsigned int vra)
{
  vui32_t shft = vec_splat_u32(23-32);
  return vec_sl (vra, shft);
}

gcc-13 -O3 -mcpu=power8 -mtune=power8
0000000000000100 <test_slw_23_v0>:                      #TL 11/11
 100:   00 00 4c 3c     addis   r2,r12,.TOC.@ha
 104:   00 00 42 38     addi    r2,r2,.TOC.@l
 108:   00 00 22 3d     addis   r9,r2,.rodata.cst16@ha  #L 2/2
 10c:   00 00 29 39     addi    r9,r9,.rodata.cst16@l   #L 2/2
 110:   ce 48 00 7c     lvx     v0,0,r9                 #L 5/5
 114:   84 01 42 10     vslw    v2,v2,v0                #L 2/2
 118:   20 00 80 4e     blr

00000000000000e0 <test_slw_23_v1>:                      #TL 4/4
  e0:   8c 03 17 10     vspltisw v0,-9                  #L 2/2
  e4:   84 01 42 10     vslw    v2,v2,v0                #L 2/2
  e8:   20 00 80 4e     blr

For inline vector gcc tends to generate load from .rodata. The addis/addi/lvx
(3 instruction) sequence is always generated for medium memory model. Only the
linker will know the final offset so there is no optimization. This is a
dependent sequence and best case (L1 cache hit) 11 cycles latency.

Using the vector unsigned int type and intrinsic vec_splat_u32()/vec_sl()
sequence generates to two instructions (vspltisw/vslw) for this simple case for
this simple case.
Again a dependent sequence for 4 cycles total. 4 cycles beats 11

gcc-13 -O3 -mcpu=power9 -mtune=power9
0000000000000100 <test_slw_23_v0>:                      #TL 7/7
 100:   d1 ba 00 f0     xxspltib vs32,23                #L 3/3
 104:   02 06 10 10     vextsb2w v0,v0                  #L 2/2
 108:   84 01 42 10     vslw    v2,v2,v0                #L 2/2
 10c:   20 00 80 4e     blr

 00000000000000e0 <test_slw_23_v1>:                     #TL 5/5
  e0:   8c 03 17 10     vspltisw v0,-9                  #L 3/3
  e4:   84 01 42 10     vslw    v2,v2,v0                #L 2/2
  e8:   20 00 80 4e     blr

Power 9 has the advantage of VSX Vector Splat Immediate Byte and will use it
the vector inline. The disadvantage is the it is a byte splat for a word shift.
To the compiler insert the (pedantic) Expand Byte to Word. This adds 1
instruction and 2 cycles latency to the sequence. 

The ISA for vector shift word only requires the low order 5-bits of each
element for the shift count. So the extend is not required and either vspltisw
or xxspltib will work here. This is an example where changing the vector shift
intrinsic to accept vector unsigned char for the shift count would be helpful.

Again the intrinsic implementation beats the compiler vector inline code by
2-cycle (5 vs 7 cycles) and one less instruction.

gcc-13 -O3 -mcpu=power10 -mtune=power10
0000000000000100 <test_slw_23_v0>:                      #TL 4/7
 100:   00 00 00 05     xxspltiw vs32,23                #L 3/4
 104:   17 00 07 80 
 108:   84 01 42 10     vslw    v2,v2,v0                #L 1/3
 10c:   20 00 80 4e     blr

00000000000000e0 <test_slw_23_v1>:                      #TL 4/7
  e0:   8c 03 17 10     vspltisw v0,-9                  #L 3/4
  e4:   84 01 42 10     vslw    v2,v2,v0                #L 1/3
  e8:   20 00 80 4e     blr

Power10 has the advantage of the VSX Vector Splat Immediate Word instruction.
This is a 8-byte prefixed instruction and is overkill for a 5-bit shift count. 

The good news is the cycle latency is the same but adds another word to the
code stream which in not required to generate such a small (5-bit) constant.

However VSX Vector Splat Immediate Word will be excellent for generating
mb/me/sh masks for Vector Rotate Left Word then Mask Insert and the like.

So I will concede the for the shift/rotate word immediate case for power10 the
latencies are comparable.

The problem I see is; as the examples get complex (generating mask for float)
or double/quadword shifts the compiler (CSE or constant propagation) will
convert splat immediate to vector load form .rodata.

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

Reply via email to