https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93720
--- Comment #2 from Dmitrij Pochepko <dpochepk at gmail dot com> --- I have a patch, which recognize such pattern and adds ins instructions. Example in this issue description is compiled fine and produce this assembly: 0000000000000000 <test_vpasted2>: 0: 6e184420 mov v0.d[1], v1.d[1] 4: d65f03c0 ret However, there are a little bit more complicated examples, where allocated registers are preventing from optimal assembly to be generated. (example: part of blender benchmark from speccpu) #include <math.h> static inline float dot_v3v3(const float a[3], const float b[3]) { return a[0] * b[0] + a[1] * b[1] + a[2] * b[2]; } static inline float len_v3(const float a[3]) { return sqrtf(dot_v3v3(a, a)); } void window_translate_m4(float winmat[4][4], float perspmat[4][4], const float x, const float y) { if (winmat[2][3] == -1.0f) { /* in the case of a win-matrix, this means perspective always */ float v1[3]; float v2[3]; float len1, len2; v1[0] = perspmat[0][0]; v1[1] = perspmat[1][0]; v1[2] = perspmat[2][0]; v2[0] = perspmat[0][1]; v2[1] = perspmat[1][1]; v2[2] = perspmat[2][1]; len1 = (1.0f / len_v3(v1)); len2 = (1.0f / len_v3(v2)); winmat[2][0] += len1 * winmat[0][0] * x; winmat[2][1] += len2 * winmat[1][1] * y; } else { winmat[3][0] += x; winmat[3][1] += y; } } This will produce: ... 24: fd400010 ldr d16, [x0] 28: fd400807 ldr d7, [x0,#16] ... 34: 6e040611 mov v17.s[0], v16.s[0] 38: 6e0c24f1 mov v17.s[1], v7.s[1] ... # v16/d17 and d7/v7 are not used in any other places while it can be: ... 24: fd400010 ldr d16, [x0] 28: fd400807 ldr d7, [x0,#16] ... 38: 6e0c24f1 mov v16.s[1], v7.s[1] ... # and v16 is used instead of v17. It looks like peephole2 can be used to optimize it. I'm currently looking into in.