[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

wschmidt at gcc dot gnu.org Fri, 12 Aug 2016 14:19:44 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585


--- Comment #10 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
The dse pass is responsible for removing all the unnecessary stack activity.  I
think that we are probably confusing it because the stores are full vector
stores, but the loads are vector element loads of smaller size.

Some evidence for this:  I can get the desired code generation by rewriting the
code to copy all the vectors in the structure into "scalar vectors" prior to
use, and doing the reverse to construct the result vector.  We then get the
code we're looking for.

To wit:

typedef struct
          {
                __vector double vx0;
                __vector double vx1;
                __vector double vx2;
                __vector double vx3;
          } vdoublex8_t;

vdoublex8_t
test_vecd8_rotate_left (vdoublex8_t a)
{
        __vector double avx0, avx1, avx2, avx3, rvx0, rvx1, rvx2, rvx3;
        __vector double temp;
        vdoublex8_t result;

        avx0 = a.vx0;
        avx1 = a.vx1;
        avx2 = a.vx2;
        avx3 = a.vx3;

        temp = a.vx0;

        /* Copy low dword of vx0 and high dword of vx1 to vx0 high / low.  */
        rvx0[VEC_DW_H] = avx0[VEC_DW_L];
        rvx0[VEC_DW_L] = avx1[VEC_DW_H];
        /* Copy low dword of vx1 and high dword of vx2 to vx1 high / low.  */
        rvx1[VEC_DW_H] = avx1[VEC_DW_L];
        rvx1[VEC_DW_L] = avx2[VEC_DW_H];
        /* Copy low dword of vx2 and high dword of vx2 to vx2 high / low.  */
        rvx2[VEC_DW_H] = avx2[VEC_DW_L];
        rvx2[VEC_DW_L] = avx3[VEC_DW_H];
        /* Copy low dword of vx3 and high dword of vx0 to vx3 high / low.  */
        rvx3[VEC_DW_H] = avx3[VEC_DW_L];
        rvx3[VEC_DW_L] = temp[VEC_DW_H];

        result.vx0 = rvx0;
        result.vx1 = rvx1;
        result.vx2 = rvx2;
        result.vx3 = rvx3;

        return (result);
}

With this we generate pretty tight code with no loads or stores.  (Just lost my
network connection to the server i was testing on, so I can't post the code,
but it looks good.)

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

Reply via email to