There are two levels of dysfunction here:

1. Why spill & fill through the stack? Why not extract scalars directly from vregs
    directly into scalar regs?
2. Why involve scalar registers at all? Why not vslide or even vrgather, using
    temporary vregs as necessary?

That's how expmed does it. If vec_extract and friends or subregs don't work we need to go via memory as last resort.

The fatal deficiency seems to be that the backend lacks vec_extractNM patterns
for mode M bigger than ELEN. Here are some ideas:

1. Define scalar modes M larger than DI mode. Aarch64 defines TI, OI, and XI modes for 128, 256, and 512-bit integers (all of which are wider than the hardware supports). 2. Define vector modes M that are half, quarter, eighth, ... width of vector mode N. That can be done with mode iterators. We already have VLS_HALF and VLS_QUARTER, but there are no such iterators for the VLA modes. Note: there are no fractional LMUL
    modes defined for SEW=64, i.e., no RVVMF[248]DI.

Yeah, generally vec_extract with vector modes is the way to go I'd say, that's generally a "VLS" line of thinking, though.

We cannot have RVVMF2DI and smaller when the minimum vector length is 64 bits. Increasing the minimum vector length helps but then we're not fully "VLA" any more.

How does aarch64 do it? Do the larger scalar modes help for your problem? They have those trn instructions I guess but doesn't their approach involve BIT_FIELD_REFs?

How is your approach, i.e. what code do you write? Do you start with C code or is this an autovec expansion? Couldn't you use vrgathers etc. right away?

--
Regards
Robin

Reply via email to