On Tue, Jun 1, 2021 at 6:25 AM Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote: > > > > > > > > > On 5/31/2021 11:50 PM, Richard Sandiford wrote: > > > > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes: > > > > > On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote: > > > > > > On Mon, May 31, 2021 at 6:26 AM Richard Biener > > > > > > <richard.guent...@gmail.com> wrote: > > > > > > > On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.to...@gmail.com> > > > > > > > wrote: > > > > > > > > On Mon, May 31, 2021 at 5:46 AM Richard Biener > > > > > > > > <richard.guent...@gmail.com> wrote: > > > > > > > > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.to...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener > > > > > > > > > > wrote: > > > > > > > > > > > > > > -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx > > > > > > > > > > > > > > DATA, scalar_int_mode > > > > > > > > > > > > > > MODE) > > > > > > > > > > > > > > This function returns the RTL of a register > > > > > > > > > > > > > > containing > > > > > > > > > > > > > > 'GET_MODE_SIZE (MODE)' consecutive copies of > > > > > > > > > > > > > > the unsigned char > > > > > > > > > > > > > > value given in the RTL register DATA. For > > > > > > > > > > > > > > example, if MODE is 4 > > > > > > > > > > > > > > bytes wide, return the RTL for > > > > > > > > > > > > > > 0x01010101*DATA. > > > > > > > > > > > > > For this one I wonder if it should be an optab > > > > > > > > > > > > > instead. Couldn't you > > > > > > > > > > > > > use the existing vec_duplicate for this by using > > > > > > > > > > > > > (paradoxical) subregs > > > > > > > > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI > > > > > > > > > > > > > (reg:QI ...)))? > > > > > > > > > > > > I tried. It doesn't even work on x86. See: > > > > > > > > > > > > > > > > > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html > > > > > > > > > > > Not sure what I should read from there... > > > > > > > > > > > > > > > > > > > > > > > There are special cases to subreg HI, SI and DI modes > > > > > > > > > > > > of TI mode in > > > > > > > > > > > > ix86_gen_memset_value_from_prev. simplify_gen_subreg > > > > > > > > > > > > doesn't > > > > > > > > > > > > work here. Each backend may need its own special > > > > > > > > > > > > handling. > > > > > > > > > > > OK, I guess I'm not (RTL) qualified enough to further > > > > > > > > > > > review these parts, > > > > > > > > > > > sorry. Since we're doing code generation the canonical > > > > > > > > > > > way to communicate > > > > > > > > > > > with backends should be optabs, not some set of > > > > > > > > > > > disconnected target hooks. > > > > > > > > > > > But as said, I probably don't know enough of RTL to see > > > > > > > > > > > why it's the only way. > > > > > > > > > > > > > > > > > > > > > > Richard. > > > > > > > > > > Here is the patch to add optabs instead. Does it look OK? > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > H.J. > > > > > > > > > > --- > > > > > > > > > > Add 2 optabs: > > > > > > > > > > > > > > > > > > > > 1. integer_extract: Extract lower bit value from the > > > > > > > > > > integer value in > > > > > > > > > > TImode, OImode or XImode. > > > > > > > > > That sounds very specific, esp. the restriction to > > > > > > > > > {TI,OI,XI}mode. > > > > > > > > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0). > > > > > > > > > There are > > > > > > > > > existing target hooks verifying subreg validity - why's that > > > > > > > > > not a good > > > > > > > > > fit here? ISTR you say gen_lowpart () doesn't work (or was it > > > > > > > > > simplify_gen_subreg?), why's that so? > > > > > > > > {TI,OI,XI}mode are storage only integer types. subreg doesn't > > > > > > > > work > > > > > > > > well on them. I got > > > > > > > > > > > > > > > > [hjl@gnu-cfl-2 pieces]$ cat s2.i > > > > > > > > extern void *ops; > > > > > > > > > > > > > > > > void > > > > > > > > foo (int c) > > > > > > > > { > > > > > > > > __builtin_memset (ops, c, 34); > > > > > > > > } > > > > > > > > [hjl@gnu-cfl-2 pieces]$ make s2.s > > > > > > > > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc > > > > > > > > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/ > > > > > > > > -O2 -march=haswell -S s2.i > > > > > > > > during RTL pass: reload > > > > > > > > s2.i: In function ‘foo’: > > > > > > > > s2.i:7:1: internal compiler error: maximum number of generated > > > > > > > > reload > > > > > > > > insns per insn achieved (90) > > > > > > > > 7 | } > > > > > > > > | ^ > > > > > > > > 0x1050734 lra_constraints(bool) > > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091 > > > > > > > > 0x1039536 lra(_IO_FILE*) > > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336 > > > > > > > > 0xfe1140 do_reload > > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822 > > > > > > > > 0xfe162e execute > > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008 > > > > > > > > Please submit a full bug report, > > > > > > > > with preprocessed source if appropriate. > > > > > > > > Please include the complete backtrace with any bug report. > > > > > > > > See <https://gcc.gnu.org/bugs/> for instructions. > > > > > > > > make: *** [Makefile:32: s2.s] Error 1 > > > > > > > > [hjl@gnu-cfl-2 pieces]$ > > > > > > > > > > > > > > > > due to > > > > > > > > > > > > > > > > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84) > > > > > > > > (const_int 32 [0x20])) [0 MEM <char[1:34]> > > > > > > > > [(void > > > > > > > > *)ops.0_1]+32 S2 A8]) > > > > > > > > (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1 > > > > > > > > (nil)) > > > > > > > > > > > > > > > > The new optab gives us > > > > > > > > > > > > > > > > (insn 12 11 13 2 (set (reg:TI 88) > > > > > > > > (reg:TI 51 xmm15)) "s2.i":6:3 -1 > > > > > > > > (nil)) > > > > > > > > (insn 13 12 14 2 (set (reg:SI 89) > > > > > > > > (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1 > > > > > > > > (nil)) > > > > > > > > (insn 14 13 15 2 (set (reg:HI 87) > > > > > > > > (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1 > > > > > > > > (nil)) > > > > > > > that looks odd to me - what's the final result after LRA? I think > > > > > > I got: > > > > > > > > > > > > vmovd %edi, %xmm15 > > > > > > movq ops(%rip), %rdx > > > > > > vpbroadcastb %xmm15, %ymm15 > > > > > > vmovq %xmm15, %rax <<<< move to GPR > > > > > > vmovdqu %ymm15, (%rdx) > > > > > > movw %ax, 32(%rdx) <<<< subreg of GPR > > > > > > vzeroupper > > > > > > ret > > > > > > > > > > > > > we should see to make lowpart_subreg work on {XI,OI,TI}mode. > > > > > > > Only two steps should be necessary at most: > > > > > > > xmm -> gpr, grp -> subreg, or gpr -> subreg. So the expander > > > > > > > code in memset should try to generate the subreg directly > > > > > > subreg didn't fail on x86 when I tried. > > > > > > > > > > > > > and if that fails, try a word_mode subreg followed by the subreg. > > > > > > I will try word_mode subreg. > > > > > > > > > > > Here is the v2 patch to use word_mode subreg. For > > > > > > > > > > --- > > > > > extern void *ops; > > > > > > > > > > void > > > > > foo (int c) > > > > > { > > > > > __builtin_memset (ops, 4, 32); > > > > > } > > > > > --- > > > > > > > > > > without vec_const_duplicate, I got > > > > > > > > > > movl $4, %eax > > > > > movq ops(%rip), %rdx > > > > > movd %eax, %xmm0 > > > > > punpcklbw %xmm0, %xmm0 > > > > > punpcklwd %xmm0, %xmm0 > > > > > pshufd $0, %xmm0, %xmm0 > > > > > movups %xmm0, (%rdx) > > > > > movups %xmm0, 16(%rdx) > > > > > ret > > > > > > > > > > With vec_const_duplicate, I got > > > > > > > > > > movq ops(%rip), %rax > > > > > movdqa .LC0(%rip), %xmm0 > > > > > movups %xmm0, (%rax) > > > > > movups %xmm0, 16(%rax) > > > > > ret > > > > > > > > > > If vec_duplicate is allowed to fail, I don't need vec_const_duplicate. > > > > I don't understand why we need an optab for this though. If the operand > > > > is constant then we should just be doing an ordinary move in which the > > > > source is a CONST_VECTOR. It's then up to the move patterns to handle > > > > duplicated constants as efficiently as possible. (Sorry if this was > > > > discussed upthread and I missed it.) > > > That's exactly the point I'm trying to get across as well. > > > > > > > This is what we do today. But I'd like to generate > > > > movl $4, %eax > > vpbroadcastb %eax, %ymm15 > > movq ops(%rip), %rax > > vmovdqu %ymm15, (%rax) > > vzeroupper > > ret > > > > instead of > > > > vmovdqa .LC0(%rip), %ymm15 > > movq ops(%rip), %rax > > vmovdqu %ymm15, (%rax) > > vzeroupper > > ret > > > > Do I need a vec_dup pattern for it? > > I think we have special code sequences to materialize some > constant vectors already, we should be able to add to that, no?
We can do that for all 0s and all 1s at the final codegen. For other values, since we need a GPR, we can't do that. -- H.J.