On Mon, May 31, 2021 at 8:33 PM H.J. Lu via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Mon, May 31, 2021 at 11:13 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Mon, May 31, 2021 at 11:07 AM Jeff Law <jeffreya...@gmail.com> wrote:
> > >
> > >
> > >
> > > On 5/31/2021 6:04 AM, H.J. Lu wrote:
> > > > On Sun, May 30, 2021 at 11:49 AM Jeff Law <jeffreya...@gmail.com> wrote:
> > > >>
> > > >>
> > > >> On 5/11/2021 5:35 PM, H.J. Lu via Gcc-patches wrote:
> > > >>> Add TARGET_READ_MEMSET_VALUE and TARGET_GEN_MEMSET_VALUE to support
> > > >>> target instructions to duplicate QImode value to TImode/OImode/XImode
> > > >>> value for memmset.  Define SCRATCH_SSE_REG as a scratch register for
> > > >>> ix86_gen_memset_value.
> > > >>>
> > > >>> gcc/
> > > >>>
> > > >>>        PR middle-end/90773
> > > >>>        * builtins.c (builtin_memset_read_str): Call
> > > >>>        targetm.read_memset_value.
> > > >>>        (builtin_memset_gen_str): Call targetm.gen_memset_value.
> > > >>>        * target.def (read_memset_value): New hook.
> > > >>>        (gen_memset_value): Likewise.
> > > >>>        * targhooks.c: Inclue "builtins.h".
> > > >>>        (default_read_memset_value): New function.
> > > >>>        (default_gen_memset_value): Likewise.
> > > >>>        * targhooks.h (default_read_memset_value): New prototype.
> > > >>>        (default_gen_memset_value): Likewise.
> > > >>>        * config/i386/i386-expand.c 
> > > >>> (ix86_expand_vector_init_duplicate):
> > > >>>        Make it global.
> > > >>>        * config/i386/i386-protos.h 
> > > >>> (ix86_minimum_incoming_stack_boundary):
> > > >>>        New.
> > > >>>        (ix86_expand_vector_init_duplicate): Likewise.
> > > >>>        * config/i386/i386.c (ix86_minimum_incoming_stack_boundary): 
> > > >>> Add
> > > >>>        an argument to ignore stack_alignment_estimated.  It is passed
> > > >>>        as false by default.
> > > >>>        (ix86_gen_memset_value_from_prev): New function.
> > > >>>        (ix86_gen_memset_value): Likewise.
> > > >>>        (ix86_read_memset_value): Likewise.
> > > >>>        (TARGET_GEN_MEMSET_VALUE): New.
> > > >>>        (TARGET_READ_MEMSET_VALUE): Likewise.
> > > >>>        * config/i386/i386.h (SCRATCH_SSE_REG): New.
> > > >>>        * doc/tm.texi.in: Add TARGET_READ_MEMSET_VALUE and
> > > >>>        TARGET_GEN_MEMSET_VALUE hooks.
> > > >>>        * doc/tm.texi: Regenerated.
> > > >>>
> > > >>> gcc/testsuite/
> > > >>>
> > > >>>        PR middle-end/90773
> > > >>>        * gcc.target/i386/pr90773-15.c: New test.
> > > >>>        * gcc.target/i386/pr90773-16.c: Likewise.
> > > >>>        * gcc.target/i386/pr90773-17.c: Likewise.
> > > >>>        * gcc.target/i386/pr90773-18.c: Likewise.
> > > >>>        * gcc.target/i386/pr90773-19.c: Likewise.
> > > >> Why does this need target hooks?  ISTM the right way to go here is to
> > > >> just emit the constant load to the target register and let the target
> > > >> figure out how best to construct the constant into the register.  If
> > > >> that means load it via QImode and broadcast, that's fine, but I'm not
> > > >> sure why that's not all implemented in the target files.
> > > >>
> > > > I will submit a patch to add optabs instead.
> > > I may be missing something, but I'm not even sure why we need special
> > > optabs.
> > >
> > > Aren't you just trying to efficiently get a constant element broadcast
> > > across an entire vector?
> >
> > Since vec_duplicate must not fail and for broadcast from a constant QImode
> > value, vec_duplicate may not be faster than a compile-time constant, I am
> > adding vec_const_duplicate.   If vec_duplicate can fail, I don't need
> > vec_const_duplicate.
> >
> > --
> > H.J.
>
>
> For
>
> extern void *ops;
>
> void
> foo (int c)
> {
>   __builtin_memset (ops, 4, 32);
> }
>
> without  vec_const_duplicate, I got
>
> movl $4, %eax
> movq ops(%rip), %rdx
> movd %eax, %xmm0
> punpcklbw %xmm0, %xmm0
> punpcklwd %xmm0, %xmm0
> pshufd $0, %xmm0, %xmm0
> movups %xmm0, (%rdx)
> movups %xmm0, 16(%rdx)
> ret
>
> with vec_const_duplicate, I got
>
> movq ops(%rip), %rax
> movdqa .LC0(%rip), %xmm0
> movups %xmm0, (%rax)
> movups %xmm0, 16(%rax)
> ret

But you can construct the duplicated constant at compile-time?
I thought the issue was that a constant pool load is _not_ the
most efficient variant?

>
> --
> H.J.

Reply via email to