10 Regression] AArch64 ld3 st4 less optimized

Jaydeep Chauhan Mon, 29 Apr 2019 06:59:46 -0700

Hello Ramana,

Description for bug-89057:
----------------------------------
This issue is occurs specifically with vcombine* method with optimization.
Assembly generated by vcombine* is less optimize. But individual
vcombine* method
can combining instruction properly and giving optimized output as per
below case.



Testcase 01:
-----------
int16x8_t
foo ()
{
  return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8));
}

Output with trunk ("-O2"):(Same for with patch)
-------------------------
foo:
.LFB3987:
        .cfi_startproc
        adrp    x0, .LC0
        ldr     q0, [x0, #:lo12:.LC0]
        ret
        .cfi_endproc
.LFE3987:
        .size   foo, .-foo
        .section        .rodata.cst16,"aM",@progbits,16
        .align  4
.LC0:
        .hword  0
        .hword  0
        .hword  0
        .hword  0
        .hword  8
        .hword  8
        .hword  8
        .hword  8
--------------------------------------------------------------------------------------------

But if vcombine call throught intrinsic function vst4_u8 combining is
not perform properly.
testcase is given below.

Testcase 02:
-----------

#include <stdint.h>
#include <arm_neon.h>

void qt_convert_rgb888_to_rgb32_neon(unsigned *dst, const unsigned
char *src, int len)
{
    const unsigned *const end = dst + len;

        const unsigned *const simdEnd = end - 7;
        // non-inline asm version (uses more moves)
        uint8x8x4_t dstVector;
        dstVector.val[3] = vdup_n_u8(0xff);
        do {
            uint8x8x3_t srcVector = vld3_u8(src);
            src += 3 * 8;
            dstVector.val[0] = srcVector.val[2];
            dstVector.val[1] = srcVector.val[1];
            dstVector.val[2] = srcVector.val[0];
///Internally vst4_u8 is calling vcombine
            vst4_u8((uint8_t*)dst, dstVector);
            dst += 8;
        } while (dst < simdEnd);

}

Output with trunk ("-O2"):
-------------------------
.L2:
        ld3     {v4.8b - v6.8b}, [x1]
        adrp    x3, .LC1
        add     x1, x1, 24
        ldr     q3, [x3, #:lo12:.LC1]
        mov     v16.8b, v6.8b
        mov     v7.8b, v5.8b
        mov     v4.8b, v4.8b
        ins     v16.d[1], v17.d[0]
        ins     v7.d[1], v17.d[0]
        ins     v4.d[1], v17.d[0]
        mov     v0.16b, v16.16b
        mov     v1.16b, v7.16b
        mov     v2.16b, v4.16b
        st4     {v0.8b - v3.8b}, [x0]
        add     x0, x0, 32
        cmp     x2, x0
        bhi     .L2
        ret
        .cfi_endproc

Output with trunk ("-O2") with patch:
-------------------------------------
.L2:
        ld3     {v4.8b - v6.8b}, [x1]
        adrp    x3, .LC1
        add     x1, x1, 24
        ldr     q3, [x3, #:lo12:.LC1]
        mov     v0.8b, v6.8b
        mov     v1.8b, v5.8b
        mov     v2.8b, v4.8b
        st4     {v0.8b - v3.8b}, [x0]
        add     x0, x0, 32
        cmp     x2, x0
        bhi     .L2
        ret

For testcase 2 combine instruction is not done properly with
aarch64_split_simd_combine
method. Internally it is combine by complex way so that it is
generating more assembly
instruction in output. So to do combining instruction with optimal way
i have added
"aarch64_combine_internal<mode>"  created for this.

We can optimize output of both testcases with only
"aarch64_combine_internal<mode>" ,But when we
run without optimization it is create problem in split instruction. So
that i have
handle both cases with optimization ("aarch64_combine_internal<mode>")and
without optimizations (aarch64_split_simd_combine) separately.

Testing:
-------
I have tested both testcase by executing below command in build directory
make -k check-gcc RUNTESTFLAGS="aarch64.exp=pr* -v" &>test.txt
and verified output of test.txt and also gcc.log.
In gcc.log status is for both case is PASS.

Thanks,
Jaydeep.

On Mon, Apr 29, 2019 at 4:19 PM Umesh Kalappa <umesh.kalap...@gmail.com> wrote:
>
> >>Before getting started with reviewing the patch , the first question
> is whether you have a copyright assignment on file or does your
> employer have one on record with the FSF ?
>
> Ramana, We asked for copyright assignment form ,with details asked by
> copyright-cl...@fsf.org(craig)
> @ ass...@gnu.org.
>
> Waiting for the reply .
> ~Umesh
>
> On Mon, Apr 29, 2019 at 2:55 PM Ramana Radhakrishnan
> <ramana....@googlemail.com> wrote:
> >
> > On Mon, Apr 29, 2019 at 8:44 AM Jaydeep Chauhan
> > <jaydeepchauhan1...@gmail.com> wrote:
> > >
> > > Hi All,
> > >
> > > The attached patch (89057.patch) fixes the subjected issue.
> > > Please let me know your thoughts on the patch.
> >
> > Thanks for your patch.
> >
> > Before getting started with reviewing the patch , the first question
> > is whether you have a copyright assignment on file or does your
> > employer have one on record with the FSF ?
> >
> > Further could you elaborate more on what you have done to fix this by
> > providing some description other than "fix the subjected issue" and
> > why you have chosen the approach that you have. Finally one of the
> > things required for any patch to be considered is a full test run. Can
> > you tell us how you tested this patch ?
> >
> >
> > Ramana
> >
> >
> >
> > >
> > > Thanks,
> > > Jaydeep.

Re: [Patch]Bug 89057 - [8/9/10 Regression] AArch64 ld3 st4 less optimized

Reply via email to