https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90424
Bug ID: 90424 Summary: memcpy into vector builtin not optimized Product: gcc Version: 9.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kretz at kde dot org Target Milestone: --- Target: x86_64-*-*, i?86-*-* Testcase (cf. https://godbolt.org/z/LsKcii): template <class T> using V [[gnu::vector_size(16)]] = T; template <class T, unsigned M = sizeof(V<T>)> V<T> load(const void *p) { using W = V<T>; W r; __builtin_memcpy(&r, p, M); return r; } // movq or movsd template V<char> load<char, 8>(const void *); // bad template V<short> load<short, 8>(const void *); // bad template V<int> load<int, 8>(const void *); // bad template V<long> load<long, 8>(const void *); // good template V<float> load<float, 8>(const void *); // bad template V<double> load<double, 8>(const void *); // good (movsd?) // movd or movss template V<char> load<char, 4>(const void *); // bad template V<short> load<short, 4>(const void *); // bad template V<int> load<int, 4>(const void *); // good template V<float> load<float, 4>(const void *); // good All of these partial loads should be translated to a single mov[qd] or movs[sd] instruction. But most of them are not.