https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119373

--- Comment #5 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> The analysis of SPEC2017's 510.parest_r shows that the topmost basic block
> is a tight loop (see attached reducer). Once vectorised, by unrolling and
> mutualising 4 instructions, AArch64 achieves a 22% reduction in dynamic
> instruction count (DIC) within the block. However, RISC-V still vectorises
> but misses the opportunity to further unroll.
> 
> The vectoriser dump for RISC-V shows the analysis fails for the natural mode
> RVVM1DF (and chooses RVVMF8QI instead) because it requires a "conversion not
> supported by target". It turns out this is caused by two missing standard
> named patterns: vec_unpacku_hi and vec_unpacku_lo.

Why do you consider RVVM1DF a "natural" mode and not RVVMF8QI?  As far as I can
see we do vectorize at full vector size

  vsetvli a5,a4,e64,m1,tu,ma 

(Tail undisturbed is unexpected as there is no masked operation in the loop
body but that looks like a separate issue).

Apart from that I don't see too many redundant instructions.  We, deliberately,
don't define unpack_hi and unpack_lo because we don't have directly matching
instructions and because we prefer widening/narrowing with the same number of
elements rather than the same vector size.

I suppose much of the icount difference is due to aarch64's complex addressing
modes.  All of the loads here include an offset and a shift while we need to do
that explicitly.  If we had similar addressing modes our icount would surely be
reduced by >30%.

Regarding unrolling: We cannot/do no unroll those length-controlled VLA loops. 
If we wanted unrolling we would need a VLS-like loop.  Could you detail what
aarch64 gains by unrolling, i.e. which instructions get elided?

Reply via email to