https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585
--- Comment #10 from Bill Schmidt <wschmidt at gcc dot gnu.org> --- The dse pass is responsible for removing all the unnecessary stack activity. I think that we are probably confusing it because the stores are full vector stores, but the loads are vector element loads of smaller size. Some evidence for this: I can get the desired code generation by rewriting the code to copy all the vectors in the structure into "scalar vectors" prior to use, and doing the reverse to construct the result vector. We then get the code we're looking for. To wit: typedef struct { __vector double vx0; __vector double vx1; __vector double vx2; __vector double vx3; } vdoublex8_t; vdoublex8_t test_vecd8_rotate_left (vdoublex8_t a) { __vector double avx0, avx1, avx2, avx3, rvx0, rvx1, rvx2, rvx3; __vector double temp; vdoublex8_t result; avx0 = a.vx0; avx1 = a.vx1; avx2 = a.vx2; avx3 = a.vx3; temp = a.vx0; /* Copy low dword of vx0 and high dword of vx1 to vx0 high / low. */ rvx0[VEC_DW_H] = avx0[VEC_DW_L]; rvx0[VEC_DW_L] = avx1[VEC_DW_H]; /* Copy low dword of vx1 and high dword of vx2 to vx1 high / low. */ rvx1[VEC_DW_H] = avx1[VEC_DW_L]; rvx1[VEC_DW_L] = avx2[VEC_DW_H]; /* Copy low dword of vx2 and high dword of vx2 to vx2 high / low. */ rvx2[VEC_DW_H] = avx2[VEC_DW_L]; rvx2[VEC_DW_L] = avx3[VEC_DW_H]; /* Copy low dword of vx3 and high dword of vx0 to vx3 high / low. */ rvx3[VEC_DW_H] = avx3[VEC_DW_L]; rvx3[VEC_DW_L] = temp[VEC_DW_H]; result.vx0 = rvx0; result.vx1 = rvx1; result.vx2 = rvx2; result.vx3 = rvx3; return (result); } With this we generate pretty tight code with no loads or stores. (Just lost my network connection to the server i was testing on, so I can't post the code, but it looks good.)