I have been working on tuning vector transpose within groups of vregs.

The canonical approach is to make multiple passes across pairs of rows,
zipping row pairs first at the API element width, then at double SEW,
continuing to double SEW at each new pass until the width reaches VLEN/2
at the final pass.

That's all good for the first few passes as long as the SEW <= ELEN. 
However,
once SEW > ELEN, the middle-end emits BIT_FIELD_REF, and that results in
stack spills of vregs followed by scalar loads. LLVM does much better, and 
emits
vslide{up,down}.

There are two levels of dysfunction here:

1. Why spill & fill through the stack? Why not extract scalars directly 
from vregs
    directly into scalar regs?
2. Why involve scalar registers at all? Why not vslide or even vrgather, 
using
    temporary vregs as necessary?

The fatal deficiency seems to be that the backend lacks vec_extractNM 
patterns
for mode M bigger than ELEN. Here are some ideas:

1. Define scalar modes M larger than DI mode. Aarch64 defines TI, OI, and 
XI modes
    for 128, 256, and 512-bit integers (all of which are wider than the 
hardware supports). 2. Define vector modes M that are half, quarter, 
eighth, ... width of vector mode N. That
    can be done with mode iterators. We already have VLS_HALF and 
VLS_QUARTER, but
    there are no such iterators for the VLA modes. Note: there are no 
fractional LMUL
    modes defined for SEW=64, i.e., no RVVMF[248]DI.

Comments? Better ideas?
G

Reply via email to