Hello Ramana, Description for bug-89057: ---------------------------------- This issue is occurs specifically with vcombine* method with optimization. Assembly generated by vcombine* is less optimize. But individual vcombine* method can combining instruction properly and giving optimized output as per below case.
Testcase 01: ----------- int16x8_t foo () { return vcombine_s16 (vdup_n_s16 (0), vdup_n_s16 (8)); } Output with trunk ("-O2"):(Same for with patch) ------------------------- foo: .LFB3987: .cfi_startproc adrp x0, .LC0 ldr q0, [x0, #:lo12:.LC0] ret .cfi_endproc .LFE3987: .size foo, .-foo .section .rodata.cst16,"aM",@progbits,16 .align 4 .LC0: .hword 0 .hword 0 .hword 0 .hword 0 .hword 8 .hword 8 .hword 8 .hword 8 -------------------------------------------------------------------------------------------- But if vcombine call throught intrinsic function vst4_u8 combining is not perform properly. testcase is given below. Testcase 02: ----------- #include <stdint.h> #include <arm_neon.h> void qt_convert_rgb888_to_rgb32_neon(unsigned *dst, const unsigned char *src, int len) { const unsigned *const end = dst + len; const unsigned *const simdEnd = end - 7; // non-inline asm version (uses more moves) uint8x8x4_t dstVector; dstVector.val[3] = vdup_n_u8(0xff); do { uint8x8x3_t srcVector = vld3_u8(src); src += 3 * 8; dstVector.val[0] = srcVector.val[2]; dstVector.val[1] = srcVector.val[1]; dstVector.val[2] = srcVector.val[0]; ///Internally vst4_u8 is calling vcombine vst4_u8((uint8_t*)dst, dstVector); dst += 8; } while (dst < simdEnd); } Output with trunk ("-O2"): ------------------------- .L2: ld3 {v4.8b - v6.8b}, [x1] adrp x3, .LC1 add x1, x1, 24 ldr q3, [x3, #:lo12:.LC1] mov v16.8b, v6.8b mov v7.8b, v5.8b mov v4.8b, v4.8b ins v16.d[1], v17.d[0] ins v7.d[1], v17.d[0] ins v4.d[1], v17.d[0] mov v0.16b, v16.16b mov v1.16b, v7.16b mov v2.16b, v4.16b st4 {v0.8b - v3.8b}, [x0] add x0, x0, 32 cmp x2, x0 bhi .L2 ret .cfi_endproc Output with trunk ("-O2") with patch: ------------------------------------- .L2: ld3 {v4.8b - v6.8b}, [x1] adrp x3, .LC1 add x1, x1, 24 ldr q3, [x3, #:lo12:.LC1] mov v0.8b, v6.8b mov v1.8b, v5.8b mov v2.8b, v4.8b st4 {v0.8b - v3.8b}, [x0] add x0, x0, 32 cmp x2, x0 bhi .L2 ret For testcase 2 combine instruction is not done properly with aarch64_split_simd_combine method. Internally it is combine by complex way so that it is generating more assembly instruction in output. So to do combining instruction with optimal way i have added "aarch64_combine_internal<mode>" created for this. We can optimize output of both testcases with only "aarch64_combine_internal<mode>" ,But when we run without optimization it is create problem in split instruction. So that i have handle both cases with optimization ("aarch64_combine_internal<mode>")and without optimizations (aarch64_split_simd_combine) separately. Testing: ------- I have tested both testcase by executing below command in build directory make -k check-gcc RUNTESTFLAGS="aarch64.exp=pr* -v" &>test.txt and verified output of test.txt and also gcc.log. In gcc.log status is for both case is PASS. Thanks, Jaydeep. On Mon, Apr 29, 2019 at 4:19 PM Umesh Kalappa <umesh.kalap...@gmail.com> wrote: > > >>Before getting started with reviewing the patch , the first question > is whether you have a copyright assignment on file or does your > employer have one on record with the FSF ? > > Ramana, We asked for copyright assignment form ,with details asked by > copyright-cl...@fsf.org(craig) > @ ass...@gnu.org. > > Waiting for the reply . > ~Umesh > > On Mon, Apr 29, 2019 at 2:55 PM Ramana Radhakrishnan > <ramana....@googlemail.com> wrote: > > > > On Mon, Apr 29, 2019 at 8:44 AM Jaydeep Chauhan > > <jaydeepchauhan1...@gmail.com> wrote: > > > > > > Hi All, > > > > > > The attached patch (89057.patch) fixes the subjected issue. > > > Please let me know your thoughts on the patch. > > > > Thanks for your patch. > > > > Before getting started with reviewing the patch , the first question > > is whether you have a copyright assignment on file or does your > > employer have one on record with the FSF ? > > > > Further could you elaborate more on what you have done to fix this by > > providing some description other than "fix the subjected issue" and > > why you have chosen the approach that you have. Finally one of the > > things required for any patch to be considered is a full test run. Can > > you tell us how you tested this patch ? > > > > > > Ramana > > > > > > > > > > > > Thanks, > > > Jaydeep.