On Thu, Jun 13, 2024 at 4:20 AM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch makes more use of m32bcst and m64bcst addressing modes in > ix86_expand_ternlog. Previously, the i386 backend would only consider > using a m32bcst if the inner mode of the vector was 32-bits, or using > m64bcst if the inner mode was 64-bits. For ternlog (and other logic > operations) this is a strange restriction, as how the same constant > is materialized is dependent upon the mode it is used/operated on. > Hence, the V16QI constant {2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2} wouldn't > use m??bcst, but (V4SI){0x02020202,0x02020202,0x02020202,0x02020202} > which has the same bit pattern would. This can optimized by (re)checking > whether a CONST_VECTOR can be broadcast from memory after casting it > to VxSI (or for m64bst to VxDI) where x has the appropriate vector size. > > > Taking the test case from pr115407: > > __attribute__((__vector_size__(64))) char v; > void foo() { > v = v | v << 7; > } > > Compiled with -O2 -mcmodel=large -mavx512bw > GCC 14 generates a 64-byte (512-bit) load from the constant pool: > > foo: movabsq $v, %rax // 10 > movabsq $.LC0, %rdx // 10 > vpsllw $7, (%rax), %zmm1 // 7 > vmovdqa64 (%rax), %zmm0 // 6 > vpternlogd $248, (%rdx), %zmm1, %zmm0 // 7 > vmovdqa64 %zmm0, (%rax) // 6 > vzeroupper // 3 > ret // 1 > .LC0: .byte -12 // 64 = 114 bytes > .byte -128 > ;; repeated another 62 times > > mainline currently generates two instructions, using interunit broadcast: > > foo: movabsq $v, %rdx // 10 > movl $-2139062144, %eax // 5 > vmovdqa64 (%rdx), %zmm2 // 6 > vpbroadcastd %eax, %zmm0 // 6 > vpsllw $7, %zmm2, %zmm1 // 7 > vpternlogd $236, %zmm0, %zmm2, %zmm1 // 7 > vmovdqa64 %zmm1, (%rdx) // 6 > vzeroupper // 3 > ret // 1 = 51 bytes > > With this patch, we now generate a broadcast addressing mode: > > foo: movabsq $v, %rax // 10 > movabsq $.LC1, %rdx // 10 > vmovdqa64 (%rax), %zmm1 // 6 > vpsllw $7, %zmm1, %zmm0 // 7 > vpternlogd $236, (%rdx){1to16}, %zmm1, %zmm0 // 7 > vmovdqa64 %zmm0, (%rax) // 6 > vzeroupper // 3 > ret // 1 = 50 total > > Without -mcmodel=large, the benefit is two instructions: > > foo: vmovdqa64 v(%rip), %zmm1 // 10 > vpsllw $7, %zmm1, %zmm0 // 7 > vpternlogd $236, .LC2(%rip){1to16}, %zmm1, %zmm0 // 11 > vmovdqa64 %zmm0, v(%rip) // 10 > vzeroupper // 3 > ret // 1 = 42 > total > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? Ok. > > > 2024-06-12 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386-expand.cc (ix86_expand_ternlog): Try performing > logic operation in a different vector mode if that enables use of > a 32-bit or 64-bit broadcast addressing mode. > > gcc/testsuite/ChangeLog > * gcc.target/i386/pr115407.c: New test case. > > > Thanks in advance, > Roger > -- >
-- BR, Hongtao