mask genration.

munroesj at gcc dot gnu.org via Gcc-bugs Mon, 07 Oct 2024 15:24:47 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007


            Bug ID: 117007
           Summary: Poor optimiation for small vector constants needed for
                    vector shift/rotate/mask genration.
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59291
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59291&action=edit
compile withe -m64 -O3 -mcpu=power8 or power9

For vector library codes there is frequent need toe "splat" small integer
constants needed for vector shifts, rotates, and mask generation. The
instructions exist (i.e. vspltisw, xxspltib, xxspltiw) supported by intrinsic.

But when these are used to provide constants VRs got other vector operations
the compiler goes out of is way to convert them to vector loads from .rodata.

This is especially bad for power8/9 as .rodata require 32-bit offsets and
always generate 3/4 instructions with a best case (L1 cache hit) latency of 9
cycles. The original splat immediate / shift implementation will run 2-4
instruction (with a good chance for CSE) and 4-6 cycles latency.

For example:

vui32_t
mask_sig_v2 ()
{
  vui32_t ones = vec_splat_u32(-1);
  vui32_t shft = vec_splat_u32(9);
  return vec_vsrw (ones, shft);
}

With GCC V6 generates:

00000000000001c0 <mask_sig_v2>:
 1c0:   8c 03 09 10     vspltisw v0,9
 1c4:   8c 03 5f 10     vspltisw v2,-1
 1c8:   84 02 42 10     vsrw    v2,v2,v0
 1cc:   20 00 80 4e     blr


While with GCC 13.2.1 generates:

00000000000001c0 <mask_sig_v2>:
 1c0:   00 00 4c 3c     addis   r2,r12,0
                        1c0: R_PPC64_REL16_HA   .TOC.
 1c4:   00 00 42 38     addi    r2,r2,0
                        1c4: R_PPC64_REL16_LO   .TOC.+0x4
 1c8:   00 00 22 3d     addis   r9,r2,0
                        1c8: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 1cc:   00 00 29 39     addi    r9,r9,0
                        1cc: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 1d0:   ce 48 40 7c     lvx     v2,0,r9
 1d4:   20 00 80 4e     blr

this is the samel for -mcpu=power8/power9

it gets worse for vector functions that require multiple shift/mask constants.

For example:

// Extract the float sig
vui32_t
test_extsig_v2 (vf32_t vrb)
{
  const vui32_t zero = vec_splat_u32(0);
  const vui32_t sigmask = mask_sig_v2 ();
  const vui32_t expmask = mask_exp_v2 ();
#if 1
  vui32_t ones = vec_splat_u32(-1);
  const vui32_t hidden = vec_sub (sigmask, ones);
#else
  const vui32_t hidden = mask_hidden_v2 ();
#endif
  vui32_t exp, sig, normal;

  exp = vec_and ((vui32_t) vrb, expmask);
  normal = vec_nor ((vui32_t) vec_cmpeq (exp, expmask),
                    (vui32_t) vec_cmpeq (exp, zero));
  sig = vec_and ((vui32_t) vrb, sigmask);
  // If normal merger hidden-bit the sig-bits
  return (vui32_t) vec_sel (sig, normal, hidden);
}

GCC V6 generated:
0000000000000310 <test_extsig_v2>:
 310:   8c 03 bf 11     vspltisw v13,-1
 314:   8c 03 37 10     vspltisw v1,-9
 318:   8c 03 60 11     vspltisw v11,0
 31c:   06 0a 0d 10     vcmpgtub v0,v13,v1
 320:   84 09 00 10     vslw    v0,v0,v1
 324:   8c 03 29 10     vspltisw v1,9
 328:   17 14 80 f1     xxland  vs44,vs32,vs34
 32c:   84 0a 2d 10     vsrw    v1,v13,v1
 330:   86 00 0c 10     vcmpequw v0,v12,v0
 334:   86 58 8c 11     vcmpequw v12,v12,v11
 338:   80 6c a1 11     vsubuwm v13,v1,v13
 33c:   17 14 41 f0     xxland  vs34,vs33,vs34
 340:   17 65 00 f0     xxlnor  vs32,vs32,vs44
 344:   7f 03 42 f0     xxsel   vs34,vs34,vs32,vs45
 348:   20 00 80 4e     blr

While GCC 13.2.1 -mcpu=power8 generates:
000000000000360 <test_extsig_v2>:
 360:   00 00 4c 3c     addis   r2,r12,0
                        360: R_PPC64_REL16_HA   .TOC.
 364:   00 00 42 38     addi    r2,r2,0
                        364: R_PPC64_REL16_LO   .TOC.+0x4
 368:   00 00 02 3d     addis   r8,r2,0
                        368: R_PPC64_TOC16_HA   .rodata.cst16+0x30
 36c:   00 00 42 3d     addis   r10,r2,0
                        36c: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 370:   8c 03 a0 11     vspltisw v13,0
 374:   00 00 08 39     addi    r8,r8,0
                        374: R_PPC64_TOC16_LO   .rodata.cst16+0x30
 378:   00 00 4a 39     addi    r10,r10,0
                        378: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 37c:   00 00 22 3d     addis   r9,r2,0
                        37c: R_PPC64_TOC16_HA   .rodata.cst16+0x40
 380:   e4 06 4a 79     rldicr  r10,r10,0,59
 384:   ce 40 20 7c     lvx     v1,0,r8
 388:   00 00 29 39     addi    r9,r9,0
                        388: R_PPC64_TOC16_LO   .rodata.cst16+0x40
 38c:   8c 03 17 10     vspltisw v0,-9
 390:   98 56 00 7c     lxvd2x  vs0,0,r10
 394:   e4 06 29 79     rldicr  r9,r9,0,59
 398:   98 4e 80 7d     lxvd2x  vs12,0,r9
 39c:   84 01 21 10     vslw    v1,v1,v0
 3a0:   50 02 00 f0     xxswapd vs0,vs0
 3a4:   17 14 01 f0     xxland  vs32,vs33,vs34
 3a8:   50 62 8c f1     xxswapd vs12,vs12
 3ac:   12 14 00 f0     xxland  vs0,vs0,vs34
 3b0:   86 00 21 10     vcmpequw v1,v1,v0
 3b4:   86 68 00 10     vcmpequw v0,v0,v13
 3b8:   17 05 21 f0     xxlnor  vs33,vs33,vs32
 3bc:   33 0b 40 f0     xxsel   vs34,vs0,vs33,vs12
 3c0:   20 00 80 4e     blr

And GCC 13.2.1 -mcpu=power9 generates:
0000000000000310 <test_extsig_v2>:
 310:   00 00 4c 3c     addis   r2,r12,0
                        310: R_PPC64_REL16_HA   .TOC.
 314:   00 00 42 38     addi    r2,r2,0
                        314: R_PPC64_REL16_LO   .TOC.+0x4
 318:   00 00 22 3d     addis   r9,r2,0
                        318: R_PPC64_TOC16_HA   .rodata.cst16+0x10
 31c:   8c 03 17 10     vspltisw v0,-9
 320:   00 00 42 3d     addis   r10,r2,0
                        320: R_PPC64_TOC16_HA   .rodata.cst16
 324:   d1 02 a0 f1     xxspltib vs45,0
 328:   00 00 29 39     addi    r9,r9,0
                        328: R_PPC64_TOC16_LO   .rodata.cst16+0x10
 32c:   00 00 4a 39     addi    r10,r10,0
                        32c: R_PPC64_TOC16_LO   .rodata.cst16
 330:   09 00 29 f4     lxv     vs33,0(r9)
 334:   01 00 0a f4     lxv     vs0,0(r10)
 338:   00 00 22 3d     addis   r9,r2,0
                        338: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 33c:   00 00 29 39     addi    r9,r9,0
                        33c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 340:   01 00 89 f5     lxv     vs12,0(r9)
 344:   84 01 21 10     vslw    v1,v1,v0
 348:   12 14 00 f0     xxland  vs0,vs0,vs34
 34c:   17 14 01 f0     xxland  vs32,vs33,vs34
 350:   86 00 21 10     vcmpequw v1,v1,v0
 354:   86 68 00 10     vcmpequw v0,v0,v13
 358:   17 05 21 f0     xxlnor  vs33,vs33,vs32
 35c:   33 0b 40 f0     xxsel   vs34,vs0,vs33,vs12
 360:   20 00 80 4e     blr

I have attached a reduced test case for vector unsigned int with more example.
None of these example should convert splat immediate instrinsics to the vector
load from .rodata.

[Bug target/117007] New: Poor optimiation for small vector constants needed for vector shift/rotate/mask genration.

Reply via email to