https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Matthias Kretz from comment #2) > I can't read the SSA code with certainty, but bit-inserting sounds like what > I want to have. Alternatively, the partial vector load could be implemented > like this - and looks even worse (https://godbolt.org/z/nJuTn-): > template <class T> > using V [[gnu::vector_size(16)]] = T; > > template <class T, unsigned... I> > V<T> load(const void *p) { > const T* q = static_cast<const T*>(p); > V<T> r = {q[I]...}; > return r; > } > > // movq or movsd > template V<char > load<char , 0,1,2,3,4,5,6,7>(const void *); > template V<short > load<short , 0,1,2,3>(const void *); > template V<int > load<int , 0,1>(const void *); > template V<long > load<long , 0>(const void *); > template V<float > load<float , 0,1>(const void *); > template V<double> load<double, 0>(const void *); > > // movd or movss > template V<char > load<char , 0,1,2,3>(const void *); > template V<short> load<short, 0,1>(const void *); > template V<int > load<int , 0>(const void *); > template V<float> load<float, 0>(const void *); Those end up like load<int, 0, 1> (const void * p) { V r; int _1; int _2; <bb 2> [local count: 1073741824]: _1 = MEM[(const int *)p_3(D)]; _2 = MEM[(const int *)p_3(D) + 4B]; r_5 = {_1, _2}; return r_5; it's not immediately clear where to optimize this - the loads would need to be merged and the constructor adjusted to one from vectors. The bswap pass looks like a good candidate for this. Split out to PR90460,