On Wed, Jun 9, 2021 at 1:17 AM Hongtao Liu <crazy...@gmail.com> wrote:
>
> On Wed, Jun 9, 2021 at 2:02 AM H.J. Lu via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > 1. Update move expanders to convert the CONST_WIDE_INT and CONST_VECTO
> > operands to vector broadcast from an integer with AVX2.
> > 2. Add ix86_gen_scratch_sse_rtx to return a scratch SSE register which
> > won't increase stack alignment requirement and blocks transformation by
> > the combine pass.
> > 3. Update PR 87767 tests to expect integer broadcast instead of broadcast
> > from memory.
> > 4. Update avx512f_cond_move.c to expect integer broadcast.
> >
> > A small benchmark:
> >
> > https://gitlab.com/x86-benchmarks/microbenchmark/-/tree/memset/broadcast
> >
> > shows that broadcast is a little bit faster on Intel Core i7-8559U:
> >
> > $ make
> > gcc -g -I. -O2   -c -o test.o test.c
> > gcc -g   -c -o memory.o memory.S
> > gcc -g   -c -o broadcast.o broadcast.S
> > gcc -g   -c -o vec_dup_sse2.o vec_dup_sse2.S
> > gcc -o test test.o memory.o broadcast.o vec_dup_sse2.o
> > ./test
> > memory      : 147215
> > broadcast   : 121213
> > vec_dup_sse2: 171366
> > $
> >
> > broadcast is also smaller:
> >
> > $ size memory.o broadcast.o
> >    text    data     bss     dec     hex filename
> >     132       0       0     132      84 memory.o
> >     122       0       0     122      7a broadcast.o
> > $
> Only the mov scenario was measured, when it comes to avx512 embedded
> broadcast it's 1 avx512 embedded broadcast instruction vs at least 3
> instructions: mov + broadcast + op. I'm not sure which is better?
>
> take pr87767 for example.
> vpaddd .LC1(%rip){1to16}, %zmm0, %zmm0
> .LC1:
>         .long   3
>
> vs
>
> movl 3, %eax
> vpbroadcastd %eax, %zmm1
> vpaddd %zmm1, %zmm0, %zmm0
>

https://gitlab.com/x86-benchmarks/microbenchmark/-/commits/vpaddd/broadcast

shows that vpbroadcastd is faster:

[hjl@gnu-skx-1 microbenchmark]$ make
gcc -g -I. -O2 -march=skylake-avx512   -c -o test.o test.c
gcc -g   -c -o memory.o memory.S
gcc -g   -c -o broadcast.o broadcast.S
gcc -o test test.o memory.o broadcast.o
./test
memory      : 425538
broadcast   : 375260
[hjl@gnu-skx-1 microbenchmark]$


-- 
H.J.

Reply via email to