I have been working on tuning vector transpose within groups of vregs. The canonical approach is to make multiple passes across pairs of rows, zipping row pairs first at the API element width, then at double SEW, continuing to double SEW at each new pass until the width reaches VLEN/2 at the final pass.
That's all good for the first few passes as long as the SEW <= ELEN. However, once SEW > ELEN, the middle-end emits BIT_FIELD_REF, and that results in stack spills of vregs followed by scalar loads. LLVM does much better, and emits vslide{up,down}. There are two levels of dysfunction here: 1. Why spill & fill through the stack? Why not extract scalars directly from vregs directly into scalar regs? 2. Why involve scalar registers at all? Why not vslide or even vrgather, using temporary vregs as necessary? The fatal deficiency seems to be that the backend lacks vec_extractNM patterns for mode M bigger than ELEN. Here are some ideas: 1. Define scalar modes M larger than DI mode. Aarch64 defines TI, OI, and XI modes for 128, 256, and 512-bit integers (all of which are wider than the hardware supports). 2. Define vector modes M that are half, quarter, eighth, ... width of vector mode N. That can be done with mode iterators. We already have VLS_HALF and VLS_QUARTER, but there are no such iterators for the VLA modes. Note: there are no fractional LMUL modes defined for SEW=64, i.e., no RVVMF[248]DI. Comments? Better ideas? G