Hi, Add Hongtao Liu as the patch affects x86. 在 2024/8/20 6:32, Richard Sandiford 写道: > HAO CHEN GUI <guih...@linux.ibm.com> writes: >> Hi, >> This patch adds const0 move checking for CLEAR_BY_PIECES. The original >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a >> constant. So even a platform doesn't support vec_duplicate, it could >> still do clear by pieces if it supports const0 move by that mode. >> >> Compared to the previous version, the main change is to set up a >> new function to generate const0 for certain modes and use the function >> as by_pieces_constfn for CLEAR_BY_PIECES. >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html >> >> Bootstrapped and tested on powerpc64-linux BE and LE with no >> regressions. >> >> On i386, it got several regressions. One issue is the predicate of >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used >> for clear by pieces with the patch. The second issue is the const0 is >> passed directly to the move expand with the patch. Originally it is >> forced to a pseudo and i386 can leverage the previous data to do >> optimization. > > The patch looks good to me, but I suppose we'll need to decide what > to do about x86. > > It's not obvious to me why movv16qi requires a nonimmediate_operand > source, especially since ix86_expand_vector_mode does have code to > cope with constant operand[1]s. emit_move_insn_1 doesn't check the > predicates anyway, so the predicate will have little effect. > > A workaround would be to check legitimate_constant_p instead of the > predicate, but I'm not sure that that should be necessary. > > Has this already been discussed? If not, we should loop in the x86 > maintainers (but I didn't do that here in case it would be a repeat).
I also noticed it. Not sure why movv16qi requires a nonimmediate_operand, while ix86_expand_vector_mode could deal with constant op. Looking forward to Hongtao's comments. > > As far as the second issue goes, I suppose there are at least three > ways of handling shared constants: > > (1) Force the zero into a register and leave later optimisations to > propagate the zero where profitable. The zero can be propagated into the store, but the address adjustment may not be combined into insn properly. For instance, if zero is forced to a register, "movv2x8qi" insn is generated. The address adjustment becomes a separate insn as "movv2x8qi" insn doesn't support d-from address. When zero is propagated, it converts "movv2x8qi" to "movti". "movti" supports d-from as well as post/inc address. Probably, the auto_inc_dec pass combines address adjustment insn into previous "movti" to generate a post inc "movti". The expected optimization might be to combine address adjustment insn into second "movit" and generate a d-form "movti". It's a regression issue I found in aarch64. Also we checks if const0 is supported for mov optab. But finally we force the const0 to a register and generate a store with the register. Seems it's not reasonable. > > (2) Emit stores of zero and expect a later pass to share constants > where beneficial. Not sure which pass can optimize it. > > (3) Generate stores of zero and leave the target expanders to force > constants into registers on the fly if reuse seems plausibly > beneficial. > The constant zero with different modes are not relevant. Not sure which pass can optimize it. The compiler should be taught that reg 102 can be expressed as a subreg of reg 100. (insn 6 5 7 2 (set (reg:V32QI 100) (const_vector:V32QI [ (const_int 0 [0]) repeated x32 ])) (insn 8 7 0 2 (set (reg:V16QI 102) (const_vector:V16QI [ (const_int 0 [0]) repeated x16 ])) I tested a case with one 32-byte and one 16-byte memory clear on x86. These two sets can't be optimized. They can be optimized only when they are in the same memory clear operation (for example, a 48-byte memory clear). Thanks Gui Haochen > where (3) is a middle ground between (1) and (2). > > Thanks, > Richard > >> The patch also raises several regressions on aarch64. The V2x8QImode >> replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand >> supports const0 and vector mode is preferable. I drafted a patch to >> address the issue. It will be sent for review in a separate email. >> Another problem is V8QImode replaces DImode to do 8-byte clear by pieces. >> It seems cause different sequences of instructions but the actually >> instructions are the same. >> >> Thanks >> Gui Haochen >> >> ChangeLog >> expand: Add const0 move checking for CLEAR_BY_PIECES optabs >> >> vec_duplicate handles duplicates of non-constant inputs. The 0 is a >> constant. So even a platform doesn't support vec_duplicate, it could >> still do clear by pieces if it supports const0 move. This patch adds >> the checking. >> >> gcc/ >> * expr.cc (by_pieces_mode_supported_p): Add const0 move checking >> for CLEAR_BY_PIECES. >> (set_zero): New. >> (clear_by_pieces): Pass set_zero as by_pieces_constfn. >> >> patch.diff >> diff --git a/gcc/expr.cc b/gcc/expr.cc >> index ffbac513692..7199e0956f8 100644 >> --- a/gcc/expr.cc >> +++ b/gcc/expr.cc >> @@ -1014,14 +1014,20 @@ can_use_qi_vectors (by_pieces_operation op) >> static bool >> by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation op) >> { >> - if (optab_handler (mov_optab, mode) == CODE_FOR_nothing) >> + enum insn_code icode = optab_handler (mov_optab, mode); >> + if (icode == CODE_FOR_nothing) >> return false; >> >> - if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES) >> + if (op == SET_BY_PIECES >> && VECTOR_MODE_P (mode) >> && optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing) >> return false; >> >> + if (op == CLEAR_BY_PIECES >> + && VECTOR_MODE_P (mode) >> + && !insn_operand_matches (icode, 1, CONST0_RTX (mode))) >> + return false; >> + >> if (op == COMPARE_BY_PIECES >> && !can_compare_p (EQ, mode, ccp_jump)) >> return false; >> @@ -1840,16 +1846,20 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len, >> return to; >> } >> >> +static rtx >> +set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode) >> +{ >> + return CONST0_RTX (mode); >> +} >> + >> void >> clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align) >> { >> if (len == 0) >> return; >> >> - /* Use builtin_memset_read_str to support vector mode broadcast. */ >> - char c = 0; >> - store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align, >> - CLEAR_BY_PIECES); >> + /* Use set_zero to generate const0 of centain mode. */ >> + store_by_pieces_d data (to, set_zero, NULL, len, align, CLEAR_BY_PIECES); >> data.run (); >> }