https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Intrinsics being type-agnostic cause vector subregs to appear before register allocation: the pseudo coming from the load has mode V2DI, the shift needs to be done in mode V4SI, the bitwise-or and the store are done in mode V2DI again. Subreg in the bitwise-or appears to be handled inefficiently. Didn't dig deeper as to what happens during allocation. FWIW, using generic vectors allows to avoid introducing such mismatches, and indeed the variant coded with generic vectors does not have extra loads. For your original code you'll have to convert between generic vectors and __m128i to use the shuffle intrinsic. The last paragraphs in "Vector Extensions" chapter [1] suggest using a union for that purpose in C; in C++ reinterpreting via union is formally UB, so another approach could be used (probably simply converting via assignment). [1] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html typedef uint32_t u32v4 __attribute__((vector_size(16))); void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict input) { u32v4 *vin = (u32v4 *)input; u32v4 *vout = (u32v4 *)out; for (unsigned i=0 ; i<1024; i+=16) { u32v4 in = *vin++; *vout++ = in | (in >> 4); } } Above code on Compiler Explorer: https://godbolt.org/z/MKPvxb