https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007
--- Comment #10 from Steven Munroe <munroesj at gcc dot gnu.org> --- (In reply to Segher Boessenkool from comment #7) > It is always more and slower code. Yes. More examples: vui64_t test_sld_52_v1 (vui64_t vra) { vui32_t shft = vec_splat_u32(52-64); return vec_vsld (vra, (vui64_t) shft); } vui64_t test_sld_52_v0 (vui64_t vra) { return vra << 52; } The PowerISA is challenged to generate an vector doubleword constant. So it seems easier to load such constants from .rodata. Again a load from .rodata is minimum of 3 instructions and latency of 9 cycles (L1 cache hit). But there are many examples of vector doubleowrd operations that need small constants. Also the doubleword shift/rotate operations only require a 6-bit shift count. Here changing the vector shift intrinsic to accept vector unsigned char for the shift count would be helpful. It is often faster to generate these constants from existing splat immediate instructions and 1-2 other operations than to pay the full latency cost of a (.rodata) vector load. For power8 the current GCC compilers will take this option away of the library developer. For example: gcc-13 -O3 -mcpu=power8 -mtune=power8 00000000000001e0 <test_sld_52_v0>: #TL 11/11 1e0: 00 00 4c 3c addis r2,r12,.TOC.@ha 1e4: 00 00 42 38 addi r2,r2,.TOC.@l 1e8: 00 00 22 3d addis r9,r2,.rodata.cst16@ha #L 2/2 1ec: 00 00 29 39 addi r9,r9,.rodata.cst16@l #L 2/2 1f0: ce 48 00 7c lvx v0,0,r9 #L 5/5 1f4: c4 05 42 10 vsld v2,v2,v0 #L 2/2 1f8: 20 00 80 4e blr 00000000000001b0 <test_sld_52_v1>: #TL 11/11 1e0: 00 00 4c 3c addis r2,r12,.TOC.@ha 1e4: 00 00 42 38 addi r2,r2,.TOC.@l 1e8: 00 00 22 3d addis r9,r2,.rodata.cst16@ha #L 2/2 1ec: 00 00 29 39 addi r9,r9,.rodata.cst16@l #L 2/2 1c0: ce 48 00 7c lvx v0,0,r9 #L 5/5 1c4: c4 05 42 10 vsld v2,v2,v0 #L 2/2 1c8: 20 00 80 4e blr While the original Power64LE support compilers would allow the library developer to use intrinsics to generation smaller/faster sequences. Again the PowerISA vector shift/Rotate doubleword operations only needs the low-order 6-bits for the shift count. Here the original altivec vec_splat_u32() can generate shift-counts for ranges 0-15 and 48-63 easily. Or if the vector shift/rotate intrinsics would accept vector unsigned char for the shift count the library developer could use vec_splat_u8(). gcc-6 -O3 -mcpu=power8 -mtune=power8 0000000000000170 <test_sld_52_v1>: #TL 4/4 170: 8c 03 14 10 vspltisw v0,-12 #L 2/2 174: c4 05 42 10 vsld v2,v2,v0 #L 2/2 178: 20 00 80 4e blr Power 9 has the advantage of VSX Vector Splat Immediate Byte and will use it the vector inline. But this will alway insert the extend signed byte to doubleword. The current Power Intrinsic Reference does not provide a direct mechanism to generate xxspltib. If vec_splat_u32() is the current compiler (constant propagation?) will convert this into the load vector (lxv this time) from .rodata. This is still 3 instructions and 9 cycles. gcc-13 -O3 -mcpu=power9 -mtune=power9 00000000000001a0 <test_sld_52_v0>: #TL 7/7 1a0: d1 a2 01 f0 xxspltib vs32,52 #L 3/3 1a4: 02 06 18 10 vextsb2d v0,v0 #L 2/2 1a8: c4 05 42 10 vsld v2,v2,v0 #L 2/2 1ac: 20 00 80 4e blr 0000000000000170 <test_sld_52_v1>: #TL 11/11 1e0: 00 00 4c 3c addis r2,r12,.TOC.@ha 1e4: 00 00 42 38 addi r2,r2,.TOC.@l 1e8: 00 00 22 3d addis r9,r2,.rodata.cst16@ha #L 2/2 1ec: 00 00 29 39 addi r9,r9,.rodata.cst16@l #L 2/2 180: 09 00 09 f4 lxv vs32,0(r9) #L 5/5 184: c4 05 42 10 vsld v2,v2,v0 #L 2/2 188: 20 00 80 4e blr This is still larger and slower then if the compiler/intrinsic would allow the direct use of xxspltib to generate the shift count for vsld. gcc-fix -O3 -mcpu=power9 -mtune=power9 0000000000000170 <test_sld_52_v1>: #TL 5/5 170: d1 a2 01 f0 xxspltib vs32,52 #L 3/3 174: c4 05 42 10 vsld v2,v2,v0 #L 2/2 178: 20 00 80 4e blr Power10 also generates VSX Vector Splat Immediate Byte and extend sign vector inline doubleword shift. But it again converts vec_splat_u32() intrinsic into a load vector (plxv this time) from .rodata. This is smaller and faster then the power9 sequence but seems a bit of overkill for the small constant (52) involved. gcc-13 -O3 -mcpu=power10 -mtune=power10 00000000000001d0 <test_sld_52_v0>: #TL 7/11 1d0: d1 a2 01 f0 xxspltib vs32,52 #L 3/4 1d4: 02 06 18 10 vextsb2d v0,v0 #L 3/4 1d8: c4 05 42 10 vsld v2,v2,v0 #L 1/3 1dc: 20 00 80 4e blr 00000000000001b0 <test_sld_52_v1>: #TL 5/9 1b0: 00 00 10 04 plxv vs32,.rodata.cst16 #L 4/6 1b8: c4 05 42 10 vsld v2,v2,v0 #L 1/3 1bc: 20 00 80 4e blr Both are larger and slower then if the compiler/intrinsic would allow the direct use of xxspltib to generate the shift count for vsld. gcc-fix -O3 -mcpu=power10 -mtune=power10 0000000000000170 <test_sld_52_v1>: #TL 4/7 170: d1 a2 01 f0 xxspltib vs32,52 #L 3/4 174: c4 05 42 10 vsld v2,v2,v0 #L 1/3 178: 20 00 80 4e blr