https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117722
--- Comment #13 from Robin Dapp <rdapp at gcc dot gnu.org> --- I don't fully understand yet :) So the full-register moves are undesirable, I agree. When accumulating with a widening op they seem unavoidable, though. The only alternative would be to split out the extension and use a regular add which would get us close to your second example. I don't see why we would prefer one over the other when the only difference is one vsetvl outside the loop. vsext and vmv1r should be comparable in latency as well. Regarding vectorizing with misaligned loads, how does that change with a usad expander?