https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93720

--- Comment #2 from Dmitrij Pochepko <dpochepk at gmail dot com> ---
I have a patch, which recognize such pattern and adds ins instructions. Example
in this issue description is compiled fine and produce this assembly:

0000000000000000 <test_vpasted2>:
   0:   6e184420        mov     v0.d[1], v1.d[1]
   4:   d65f03c0        ret

However, there are a little bit more complicated examples, where allocated
registers are preventing from optimal assembly to be generated.
(example: part of blender benchmark from speccpu)
#include <math.h>

static inline float dot_v3v3(const float a[3], const float b[3])
{
        return a[0] * b[0] + a[1] * b[1] + a[2] * b[2];
}

static inline float len_v3(const float a[3])
{
        return sqrtf(dot_v3v3(a, a));
}


void window_translate_m4(float winmat[4][4], float perspmat[4][4], const float
x, const float y)
{
        if (winmat[2][3] == -1.0f) {
                /* in the case of a win-matrix, this means perspective always
*/
                float v1[3];
                float v2[3];
                float len1, len2;

                v1[0] = perspmat[0][0];
                v1[1] = perspmat[1][0];
                v1[2] = perspmat[2][0];

                v2[0] = perspmat[0][1];
                v2[1] = perspmat[1][1];
                v2[2] = perspmat[2][1];

                len1 = (1.0f / len_v3(v1));
                len2 = (1.0f / len_v3(v2));

                winmat[2][0] += len1 * winmat[0][0] * x;
                winmat[2][1] += len2 * winmat[1][1] * y;
        }
        else {
                winmat[3][0] += x;
                winmat[3][1] += y;
        }
}


This will produce:
...
  24:   fd400010        ldr     d16, [x0]
  28:   fd400807        ldr     d7, [x0,#16]
...
  34:   6e040611        mov     v17.s[0], v16.s[0]
  38:   6e0c24f1        mov     v17.s[1], v7.s[1]
...
# v16/d17 and d7/v7 are not used in any other places

while it can be:

...
  24:   fd400010        ldr     d16, [x0]
  28:   fd400807        ldr     d7, [x0,#16]
...
  38:   6e0c24f1        mov     v16.s[1], v7.s[1]
...
# and v16 is used instead of v17.


It looks like peephole2 can be used to optimize it. I'm currently looking into
in.

Reply via email to