Hi Richard, Thanks for the suggestions.
> -----Original Message----- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Thursday, May 21, 2020 5:22 PM > To: Yangfei (Felix) <felix.y...@huawei.com> > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length > > "Yangfei (Felix)" <felix.y...@huawei.com> writes: > > Hi, > > > > Notice a tiny SVE-related performance issue: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95254 > > > > For the given test case, SLP succeeds with VNx8HImode with or without > option -msve-vector-bits=256. > > The root cause for the difference is that we choose a different mode in > aarch64_vectorize_related_mode under -msve-vector-bits=256: > VNx2HImode instead of V4HImode. > > Then in the final tree ssa forwprop pass, we need to do a VIEW_CONVERT > from V4HImode to VNx2HImode. > > > > PATCH catch and simplify the pattern in > aarch64_expand_sve_mem_move, emitting a mov pattern of V4HImode > instead. > > I am assuming endianness does not make a difference here considering > this simplification. > > Bootstrap and tested on aarch64-linux-gnu. Comments? > > I think we should try to catch this at the gimple level, possibly during SLP > itself. Agreed. It's better if this can be handled in SLP. For the given test case, the difference reflect itself after the final ssa forwprop. So I guess it might be hard for SLP to catch and evaluate in the vect cost model. > Although the patch handles the case in which the V4HI is stored directly to > memory, I assume it won't help if the code instead does: > > for (i = 0; i < 4; i++) > b[i] = u.a[i] + 1; SLP failed if you modify like that. > targetm.can_change_mode_class (..., ALL_REGS) would be a good indicator > of whether the needed VIEW_CONVERT_EXPR is cheap or expensive. > > I agree it might still be worth handling this in the move patterns too. > It feels like a target-independent optimisation though, and for example > should also apply to V4HI moves involving subregs of VNx2HIs. > > So I think it would be worth trying to do this in emit_move_insn. > In principle it would be useful for: > > // M1 and M2 equal size, !targetm.can_change_mode_class (M1, M2, > ALL_REGS) > > (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...))) > (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR)) > (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...))) > (set (subreg:M1 (reg:M2 ...)) (constant C)) > > It would be nice if we could do this even without the > can_change_mode_class condition, provided that it doesn't turn valid M1 > constants or MEMs into invalid M2 ones (or at least, M2 ones that need to be > validated). > Unfortunately, going that far is likely to interfere with target-specific > choices, > so it's probably too dangerous. > > With the can_change_mode_class condition it should be fine though, since > it's avoiding an implicit round trip through memory. The change should be a > win even if the MEMs or constants need legitimising for M2. Good suggestion. I worked out a v2 patch with the logic moved to in emit_move_insn. For (set (subreg:M1 (reg:M2 ...)) (constant C)) case, I am simply requiring C should also be a legitimate constant for M2. It works for the test case. I will do a full test for the v2 patch if you like it. > > [...] > > + if (inner != NULL_RTX > > + && aarch64_classify_vector_mode (inner_mode) == VEC_ADVSIMD > > + && GET_MODE_INNER (mode) == GET_MODE_INNER (inner_mode) > > + && known_eq (GET_MODE_SIZE (mode), GET_MODE_SIZE > (inner_mode)) > > + && GET_MODE_BITSIZE (inner_mode).to_constant () <= alignment) > > + { > > + rtx addr, mem; > > + if (MEM_P (src)) > > + { > > + addr = XEXP (src, 0); > > + mem = gen_rtx_MEM (inner_mode, addr); > > + emit_move_insn (inner, mem); > > + } > > + else > > + { > > + addr = XEXP (dest, 0); > > + mem = gen_rtx_MEM (inner_mode, addr); > > + emit_move_insn (mem, inner); > > gen_rtx_MEM shouldn't be used to create new versions of existing MEMs, > since it drops all the attributes. It's better to use something like > adjust_address instead. That will also take care of making sure that the > address is valid for the new mode. It was my fault. I should have realized that difference. :-) Felix
pr95254-v2.diff
Description: pr95254-v2.diff