https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007
--- Comment #9 from Steven Munroe <munroesj at gcc dot gnu.org> --- (In reply to Segher Boessenkool from comment #7) > It is always more and slower code. Yes. lets try some specific examples and examine the code generated for power8/9/10 vui32_t test_slw_23_v0 (vui32_t vra) { return vra << 23; } vui32_t test_slw_23_v1 (__vector unsigned int vra) { vui32_t shft = vec_splat_u32(23-32); return vec_sl (vra, shft); } gcc-13 -O3 -mcpu=power8 -mtune=power8 0000000000000100 <test_slw_23_v0>: #TL 11/11 100: 00 00 4c 3c addis r2,r12,.TOC.@ha 104: 00 00 42 38 addi r2,r2,.TOC.@l 108: 00 00 22 3d addis r9,r2,.rodata.cst16@ha #L 2/2 10c: 00 00 29 39 addi r9,r9,.rodata.cst16@l #L 2/2 110: ce 48 00 7c lvx v0,0,r9 #L 5/5 114: 84 01 42 10 vslw v2,v2,v0 #L 2/2 118: 20 00 80 4e blr 00000000000000e0 <test_slw_23_v1>: #TL 4/4 e0: 8c 03 17 10 vspltisw v0,-9 #L 2/2 e4: 84 01 42 10 vslw v2,v2,v0 #L 2/2 e8: 20 00 80 4e blr For inline vector gcc tends to generate load from .rodata. The addis/addi/lvx (3 instruction) sequence is always generated for medium memory model. Only the linker will know the final offset so there is no optimization. This is a dependent sequence and best case (L1 cache hit) 11 cycles latency. Using the vector unsigned int type and intrinsic vec_splat_u32()/vec_sl() sequence generates to two instructions (vspltisw/vslw) for this simple case for this simple case. Again a dependent sequence for 4 cycles total. 4 cycles beats 11 gcc-13 -O3 -mcpu=power9 -mtune=power9 0000000000000100 <test_slw_23_v0>: #TL 7/7 100: d1 ba 00 f0 xxspltib vs32,23 #L 3/3 104: 02 06 10 10 vextsb2w v0,v0 #L 2/2 108: 84 01 42 10 vslw v2,v2,v0 #L 2/2 10c: 20 00 80 4e blr 00000000000000e0 <test_slw_23_v1>: #TL 5/5 e0: 8c 03 17 10 vspltisw v0,-9 #L 3/3 e4: 84 01 42 10 vslw v2,v2,v0 #L 2/2 e8: 20 00 80 4e blr Power 9 has the advantage of VSX Vector Splat Immediate Byte and will use it the vector inline. The disadvantage is the it is a byte splat for a word shift. To the compiler insert the (pedantic) Expand Byte to Word. This adds 1 instruction and 2 cycles latency to the sequence. The ISA for vector shift word only requires the low order 5-bits of each element for the shift count. So the extend is not required and either vspltisw or xxspltib will work here. This is an example where changing the vector shift intrinsic to accept vector unsigned char for the shift count would be helpful. Again the intrinsic implementation beats the compiler vector inline code by 2-cycle (5 vs 7 cycles) and one less instruction. gcc-13 -O3 -mcpu=power10 -mtune=power10 0000000000000100 <test_slw_23_v0>: #TL 4/7 100: 00 00 00 05 xxspltiw vs32,23 #L 3/4 104: 17 00 07 80 108: 84 01 42 10 vslw v2,v2,v0 #L 1/3 10c: 20 00 80 4e blr 00000000000000e0 <test_slw_23_v1>: #TL 4/7 e0: 8c 03 17 10 vspltisw v0,-9 #L 3/4 e4: 84 01 42 10 vslw v2,v2,v0 #L 1/3 e8: 20 00 80 4e blr Power10 has the advantage of the VSX Vector Splat Immediate Word instruction. This is a 8-byte prefixed instruction and is overkill for a 5-bit shift count. The good news is the cycle latency is the same but adds another word to the code stream which in not required to generate such a small (5-bit) constant. However VSX Vector Splat Immediate Word will be excellent for generating mb/me/sh masks for Vector Rotate Left Word then Mask Insert and the like. So I will concede the for the shift/rotate word immediate case for power10 the latencies are comparable. The problem I see is; as the examples get complex (generating mask for float) or double/quadword shifts the compiler (CSE or constant propagation) will convert splat immediate to vector load form .rodata.