https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007
Bug ID: 117007 Summary: Poor optimiation for small vector constants needed for vector shift/rotate/mask genration. Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 59291 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59291&action=edit compile withe -m64 -O3 -mcpu=power8 or power9 For vector library codes there is frequent need toe "splat" small integer constants needed for vector shifts, rotates, and mask generation. The instructions exist (i.e. vspltisw, xxspltib, xxspltiw) supported by intrinsic. But when these are used to provide constants VRs got other vector operations the compiler goes out of is way to convert them to vector loads from .rodata. This is especially bad for power8/9 as .rodata require 32-bit offsets and always generate 3/4 instructions with a best case (L1 cache hit) latency of 9 cycles. The original splat immediate / shift implementation will run 2-4 instruction (with a good chance for CSE) and 4-6 cycles latency. For example: vui32_t mask_sig_v2 () { vui32_t ones = vec_splat_u32(-1); vui32_t shft = vec_splat_u32(9); return vec_vsrw (ones, shft); } With GCC V6 generates: 00000000000001c0 <mask_sig_v2>: 1c0: 8c 03 09 10 vspltisw v0,9 1c4: 8c 03 5f 10 vspltisw v2,-1 1c8: 84 02 42 10 vsrw v2,v2,v0 1cc: 20 00 80 4e blr While with GCC 13.2.1 generates: 00000000000001c0 <mask_sig_v2>: 1c0: 00 00 4c 3c addis r2,r12,0 1c0: R_PPC64_REL16_HA .TOC. 1c4: 00 00 42 38 addi r2,r2,0 1c4: R_PPC64_REL16_LO .TOC.+0x4 1c8: 00 00 22 3d addis r9,r2,0 1c8: R_PPC64_TOC16_HA .rodata.cst16+0x20 1cc: 00 00 29 39 addi r9,r9,0 1cc: R_PPC64_TOC16_LO .rodata.cst16+0x20 1d0: ce 48 40 7c lvx v2,0,r9 1d4: 20 00 80 4e blr this is the samel for -mcpu=power8/power9 it gets worse for vector functions that require multiple shift/mask constants. For example: // Extract the float sig vui32_t test_extsig_v2 (vf32_t vrb) { const vui32_t zero = vec_splat_u32(0); const vui32_t sigmask = mask_sig_v2 (); const vui32_t expmask = mask_exp_v2 (); #if 1 vui32_t ones = vec_splat_u32(-1); const vui32_t hidden = vec_sub (sigmask, ones); #else const vui32_t hidden = mask_hidden_v2 (); #endif vui32_t exp, sig, normal; exp = vec_and ((vui32_t) vrb, expmask); normal = vec_nor ((vui32_t) vec_cmpeq (exp, expmask), (vui32_t) vec_cmpeq (exp, zero)); sig = vec_and ((vui32_t) vrb, sigmask); // If normal merger hidden-bit the sig-bits return (vui32_t) vec_sel (sig, normal, hidden); } GCC V6 generated: 0000000000000310 <test_extsig_v2>: 310: 8c 03 bf 11 vspltisw v13,-1 314: 8c 03 37 10 vspltisw v1,-9 318: 8c 03 60 11 vspltisw v11,0 31c: 06 0a 0d 10 vcmpgtub v0,v13,v1 320: 84 09 00 10 vslw v0,v0,v1 324: 8c 03 29 10 vspltisw v1,9 328: 17 14 80 f1 xxland vs44,vs32,vs34 32c: 84 0a 2d 10 vsrw v1,v13,v1 330: 86 00 0c 10 vcmpequw v0,v12,v0 334: 86 58 8c 11 vcmpequw v12,v12,v11 338: 80 6c a1 11 vsubuwm v13,v1,v13 33c: 17 14 41 f0 xxland vs34,vs33,vs34 340: 17 65 00 f0 xxlnor vs32,vs32,vs44 344: 7f 03 42 f0 xxsel vs34,vs34,vs32,vs45 348: 20 00 80 4e blr While GCC 13.2.1 -mcpu=power8 generates: 000000000000360 <test_extsig_v2>: 360: 00 00 4c 3c addis r2,r12,0 360: R_PPC64_REL16_HA .TOC. 364: 00 00 42 38 addi r2,r2,0 364: R_PPC64_REL16_LO .TOC.+0x4 368: 00 00 02 3d addis r8,r2,0 368: R_PPC64_TOC16_HA .rodata.cst16+0x30 36c: 00 00 42 3d addis r10,r2,0 36c: R_PPC64_TOC16_HA .rodata.cst16+0x20 370: 8c 03 a0 11 vspltisw v13,0 374: 00 00 08 39 addi r8,r8,0 374: R_PPC64_TOC16_LO .rodata.cst16+0x30 378: 00 00 4a 39 addi r10,r10,0 378: R_PPC64_TOC16_LO .rodata.cst16+0x20 37c: 00 00 22 3d addis r9,r2,0 37c: R_PPC64_TOC16_HA .rodata.cst16+0x40 380: e4 06 4a 79 rldicr r10,r10,0,59 384: ce 40 20 7c lvx v1,0,r8 388: 00 00 29 39 addi r9,r9,0 388: R_PPC64_TOC16_LO .rodata.cst16+0x40 38c: 8c 03 17 10 vspltisw v0,-9 390: 98 56 00 7c lxvd2x vs0,0,r10 394: e4 06 29 79 rldicr r9,r9,0,59 398: 98 4e 80 7d lxvd2x vs12,0,r9 39c: 84 01 21 10 vslw v1,v1,v0 3a0: 50 02 00 f0 xxswapd vs0,vs0 3a4: 17 14 01 f0 xxland vs32,vs33,vs34 3a8: 50 62 8c f1 xxswapd vs12,vs12 3ac: 12 14 00 f0 xxland vs0,vs0,vs34 3b0: 86 00 21 10 vcmpequw v1,v1,v0 3b4: 86 68 00 10 vcmpequw v0,v0,v13 3b8: 17 05 21 f0 xxlnor vs33,vs33,vs32 3bc: 33 0b 40 f0 xxsel vs34,vs0,vs33,vs12 3c0: 20 00 80 4e blr And GCC 13.2.1 -mcpu=power9 generates: 0000000000000310 <test_extsig_v2>: 310: 00 00 4c 3c addis r2,r12,0 310: R_PPC64_REL16_HA .TOC. 314: 00 00 42 38 addi r2,r2,0 314: R_PPC64_REL16_LO .TOC.+0x4 318: 00 00 22 3d addis r9,r2,0 318: R_PPC64_TOC16_HA .rodata.cst16+0x10 31c: 8c 03 17 10 vspltisw v0,-9 320: 00 00 42 3d addis r10,r2,0 320: R_PPC64_TOC16_HA .rodata.cst16 324: d1 02 a0 f1 xxspltib vs45,0 328: 00 00 29 39 addi r9,r9,0 328: R_PPC64_TOC16_LO .rodata.cst16+0x10 32c: 00 00 4a 39 addi r10,r10,0 32c: R_PPC64_TOC16_LO .rodata.cst16 330: 09 00 29 f4 lxv vs33,0(r9) 334: 01 00 0a f4 lxv vs0,0(r10) 338: 00 00 22 3d addis r9,r2,0 338: R_PPC64_TOC16_HA .rodata.cst16+0x20 33c: 00 00 29 39 addi r9,r9,0 33c: R_PPC64_TOC16_LO .rodata.cst16+0x20 340: 01 00 89 f5 lxv vs12,0(r9) 344: 84 01 21 10 vslw v1,v1,v0 348: 12 14 00 f0 xxland vs0,vs0,vs34 34c: 17 14 01 f0 xxland vs32,vs33,vs34 350: 86 00 21 10 vcmpequw v1,v1,v0 354: 86 68 00 10 vcmpequw v0,v0,v13 358: 17 05 21 f0 xxlnor vs33,vs33,vs32 35c: 33 0b 40 f0 xxsel vs34,vs0,vs33,vs12 360: 20 00 80 4e blr I have attached a reduced test case for vector unsigned int with more example. None of these example should convert splat immediate instrinsics to the vector load from .rodata.