luoxhu <luo...@linux.ibm.com> writes: > Hi, > > On 2020/7/21 23:30, Richard Sandiford wrote: >> Xiong Hu Luo <luo...@linux.ibm.com> writes:>> @@ -1872,9 +1872,27 @@ >> get_stored_val (store_info *store_info, machine_mode read_mode, >>> { >>> poly_int64 shift = gap * BITS_PER_UNIT; >>> poly_int64 access_size = GET_MODE_SIZE (read_mode) + gap; >>> - read_reg = find_shift_sequence (access_size, store_info, read_mode, >>> - shift, optimize_bb_for_speed_p (bb), >>> - require_cst); >>> + rtx rhs_subreg = NULL; >>> + >>> + if (known_eq (GET_MODE_BITSIZE (store_mode), shift * 2)) >>> + { >>> + scalar_int_mode inner_mode = smallest_int_mode_for_size (shift); >>> + poly_uint64 sub_off >>> + = ((!BYTES_BIG_ENDIAN) >>> + ? GET_MODE_SIZE (store_mode) - GET_MODE_SIZE (inner_mode) >>> + : 0); >>> + >>> + rhs_subreg = simplify_gen_subreg (inner_mode, store_info->rhs, >>> + store_mode, sub_off); >>> + if (rhs_subreg) >>> + read_reg >>> + = extract_low_bits (read_mode, inner_mode, copy_rtx (rhs_subreg)); >>> + } >>> + >>> + if (read_reg == NULL) >>> + read_reg >>> + = find_shift_sequence (access_size, store_info, read_mode, shift, >>> + optimize_bb_for_speed_p (bb), require_cst); >> >> Did you consider doing this in find_shift_sequence instead? >> ISTM that this is really using subregs to optimise: >> >> /* In theory we could also check for an ashr. Ian Taylor knows >> of one dsp where the cost of these two was not the same. But >> this really is a rare case anyway. */ >> target = expand_binop (new_mode, lshr_optab, new_reg, >> gen_int_shift_amount (new_mode, shift), >> new_reg, 1, OPTAB_DIRECT); >> >> I think everything up to: >> >> /* Also try a wider mode if the necessary punning is either not >> desirable or not possible. */ >> if (!CONSTANT_P (store_info->rhs) >> && !targetm.modes_tieable_p (new_mode, store_mode)) >> continue; >> >> is either neutral or helpful for the subreg case too, so maybe >> we could just add the optimisation after that. (It probably isn't >> worth reusing any of the later loop body code though, since the >> subreg case is much simpler.) >> >> I don't think we need to restrict this case to modes of size >> shift * 2. We can just check whether the shift is a multiple of >> the new_mode calculated by find_shift_sequence (using multiple_p). >> >> An easier way of converting the shift to a subreg byte offset >> is to use subreg_offset_from_lsb, which also handles >> BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN. >> > > Thanks, I've updated the patch by moving it into find_shift_sequence. > Not sure whether meets your comments precisely though it still works:) > There is a comment mentioned that > > /* Some machines like the x86 have shift insns for each size of > operand. Other machines like the ppc or the ia-64 may only have > shift insns that shift values within 32 or 64 bit registers. > This loop tries to find the smallest shift insn that will right > justify the value we want to read but is available in one insn on > the machine. */ > > So it will early break without some additional check as the new_mode is > TImode here: > > if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD) > break; > > > > [PATCH v2] dse: Remove partial load after full store for high part > access[PR71309] > > > This patch could optimize (works for char/short/int/void*): > > 6: r119:TI=[r118:DI+0x10] > 7: [r118:DI]=r119:TI > 8: r121:DI=[r118:DI+0x8] > > => > > 6: r119:TI=[r118:DI+0x10] > 18: r122:TI=r119:TI > 16: r123:TI#0=r122:TI#8 0>>0 > 17: r123:TI#8=0 > 19: r124:DI=r123:TI#0 > 7: [r118:DI]=r119:TI > 8: r121:DI=r124:DI > > Final ASM will be as below without partial load after full store(stxv+ld): > mr 9,3 > ld 3,24(3) > ld 10,16(3) > std 3,8(9) > std 10,0(9) > blr > > It could achieve ~25% performance improvement for typical cases on > Power9. Bootstrap and regression testing on Power9-LE. > > For AArch64, one ldr is replaced by mov: > > ldp x2, x3, [x0, 16] > stp x2, x3, [x0] > ldr x0, [x0, 8] > > => > > mov x1, x0 > ldp x2, x0, [x0, 16] > stp x2, x0, [x1] > > gcc/ChangeLog: > > 2020-07-22 Xionghu Luo <luo...@linux.ibm.com> > > PR rtl-optimization/71309 > * dse.c (find_shift_sequence): Use subreg of shifted from high part > register to avoid loading from address. > > gcc/testsuite/ChangeLog: > > 2020-07-22 Xionghu Luo <luo...@linux.ibm.com> > > PR rtl-optimization/71309 > * gcc.target/powerpc/pr71309.c: New test. > --- > gcc/dse.c | 15 +++++++++- > gcc/testsuite/gcc.target/powerpc/pr71309.c | 33 ++++++++++++++++++++++ > 2 files changed, 47 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.target/powerpc/pr71309.c > > diff --git a/gcc/dse.c b/gcc/dse.c > index bbe792e48e8..e06a9fbb0cd 100644 > --- a/gcc/dse.c > +++ b/gcc/dse.c > @@ -1736,7 +1736,8 @@ find_shift_sequence (poly_int64 access_size, > int cost; > > new_mode = new_mode_iter.require (); > - if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD) > + if (GET_MODE_BITSIZE (new_mode) > BITS_PER_WORD > + && !multiple_p (GET_MODE_BITSIZE (new_mode), shift)) > break; > > /* If a constant was stored into memory, try to simplify it here, > @@ -1793,6 +1794,18 @@ find_shift_sequence (poly_int64 access_size, > shift_seq = get_insns (); > end_sequence (); > > + /* Use subreg shifting from high part to avoid full store followed by > + partial load. See PR71309. */ > + if (multiple_p (GET_MODE_BITSIZE (new_mode), shift) && shift_seq) > + { > + new_lhs = extract_low_bits (new_mode, store_mode, > + copy_rtx (store_info->rhs)); > + emit_move_insn (new_reg, new_lhs); > + emit_insn (shift_seq); > + read_reg = extract_low_bits (read_mode, new_mode, target); > + break; > + } > +
This wasn't really what I meant. Using subregs is fine, but I was thinking of: /* Also try a wider mode if the necessary punning is either not desirable or not possible. */ if (!CONSTANT_P (store_info->rhs) && !targetm.modes_tieable_p (new_mode, store_mode)) continue; if (multiple_p (shift, GET_MODE_BITSIZE (new_mode))) { /* Try to implement the shift using a subreg. */ poly_int64 offset = subreg_offset_from_lsb (new_mode, store_mode, shift); rhs_subreg = simplify_gen_subreg (new_mode, store_info->rhs, store_mode, offset); if (rhs_subreg) { ... break; } } where the rhs_subreg is from your original patch. The multiple_p should be that way round: the shift needs to be a multiple of the new_mode for the subreg to be valid. I think this should also avoid the BITS_PER_WORD problem. On the other hand, I agree BITS_PER_UNIT isn't a very sensible limit if we're using subregs, so maybe moving it to after the multiple_p if block would still make sense. Thanks, Richard