On 23 May 2016 at 14:59, Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> wrote: > On 5 February 2016 at 18:40, Prathamesh Kulkarni > <prathamesh.kulka...@linaro.org> wrote: >> On 4 February 2016 at 16:31, Ramana Radhakrishnan >> <ramana....@googlemail.com> wrote: >>> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni >>> <prathamesh.kulka...@linaro.org> wrote: >>>> On 31 July 2015 at 15:04, Ramana Radhakrishnan >>>> <ramana.radhakrish...@foss.arm.com> wrote: >>>>> >>>>> >>>>> On 29/07/15 11:09, Prathamesh Kulkarni wrote: >>>>>> Hi, >>>>>> This patch tries to implement division with multiplication by >>>>>> reciprocal using vrecpe/vrecps >>>>>> with -funsafe-math-optimizations and -freciprocal-math enabled. >>>>>> Tested on arm-none-linux-gnueabihf using qemu. >>>>>> OK for trunk ? >>>>>> >>>>>> Thank you, >>>>>> Prathamesh >>>>>> >>>>> >>>>> I've tried this in the past and never been convinced that 2 iterations >>>>> are enough to get to stability with this given that the results are only >>>>> precise for 8 bits / iteration. Thus I've always believed you need 3 >>>>> iterations rather than 2 at which point I've never been sure that it's >>>>> worth it. So the testing that you've done with this currently is not >>>>> enough for this to go into the tree. >>>>> >>>>> I'd like this to be tested on a couple of different AArch32 >>>>> implementations with a wider range of inputs to verify that the results >>>>> are acceptable as well as running something like SPEC2k(6) with atleast >>>>> one iteration to ensure correctness. >>>> Hi, >>>> I got results of SPEC2k6 fp benchmarks: >>>> a15: +0.64% overall, 481.wrf: +6.46% >>>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% >>>> a57: +0.35% overall, 481.wrf: +3.84% >>>> The other benchmarks had (almost) identical results. >>> >>> Thanks for the benchmarking results - Please repost the patch with >>> the changes that I had requested in my previous review - given it is >>> now stage4 , I would rather queue changes like this for stage1 now. >> Hi, >> Please find the updated patch attached. >> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and >> arm-none-eabi. >> However the test-case added in the patch (neon-vect-div-1.c) fails to >> get vectorized at -O2 >> for armeb-none-linux-gnueabihf. >> Charles suggested me to try with -O3, which worked. >> It appears the test-case fails to get vectorized with >> -fvect-cost-model=cheap (which is default enabled at -O2) >> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic >> >> I can't figure out why it fails -fvect-cost-model=cheap. >> From the vect dump (attached): >> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 > Hi, > I think I have some idea why the test-case fails attached with patch > fail to get vectorized on armeb with -O2. > > Issue with big endian vectorizer: > The patch does not cause regressions on big endian vectorizer but > fails to vectorize the test-cases attached with the patch, while they > get vectorized on > litttle-endian. > Fails with armeb with the following message in dump: > note: not vectorized: unsupported unaligned load.*_9 > > The behavior of big and little endian vectorizer seems to be different > in arm_builtin_support_vector_misalignment() which overrides the hook > targetm.vectorize.support_vector_misalignment(). > > targetm.vectorize.support_vector_misalignment is called by > vect_supportable_dr_alignment () which in turn is called > by verify_data_refs_alignment (). > > Execution upto following condition is common between arm and armeb > in vect_supportable_dr_alignment(): > > if ((TYPE_USER_ALIGN (type) && !is_packed) > || targetm.vectorize.support_vector_misalignment (mode, type, > DR_MISALIGNMENT (dr), is_packed)) > /* Can't software pipeline the loads, but can at least do them. */ > return dr_unaligned_supported; > > For little endian case: > arm_builtin_support_vector_misalignment() is called with > V2SF mode and misalignment == -1, and the following condition > becomes true: > /* If the misalignment is unknown, we should be able to handle the access > so long as it is not to a member of a packed data structure. */ > if (misalignment == -1) > return true; > > Since the hook returned true we enter the condition above in > vect_supportable_dr_alignment() and return dr_unaligned_supported; > > For big-endian: > arm_builtin_support_vector_misalignment() is called with V2SF mode. > The following condition that gates the entire function body fails: > if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) > and the default hook gets called with V2SF mode and the default hook > returns false because > movmisalign_optab does not exist for V2SF mode. > > So the condition above in vect_supportable_dr_alignment() fails > and we come here: > /* Unsupported. */ > return dr_unaligned_unsupported; > > And hence we get the unaligned load not supported message in the dump > for armeb in verify_data_ref_alignment (): > > static bool > verify_data_ref_alignment (data_reference_p dr) > { > enum dr_alignment_support supportable_dr_alignment > = vect_supportable_dr_alignment (dr, false); > if (!supportable_dr_alignment) > { > if (dump_enabled_p ()) > { > if (DR_IS_READ (dr)) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "not vectorized: unsupported unaligned load."); > > With -O3, the test-cases vectorize for armeb, because loop peeling for > alignment > is turned on. > The above behavior is also reproducible with test-case which is > irrelevant to the patch. > for instance, we get the same unsupported unaligned load for following > test-case (replaced / with +) > > void > foo (int len, float * __restrict p, float *__restrict x) > { > len = len & ~31; > for (int i = 0; i < len; i++) > p[i] = p[i] + x[i]; > } > Is the patch OK to commit after bootstrap+test ? ping https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
Thanks, Prathamesh > > Thanks, > Prathamesh >> >> Thanks, >> Prathamesh >>> >>> Thanks, >>> Ramana >>> >>>> >>>> Thanks, >>>> Prathamesh >>>>> >>>>> >>>>> moving on to the patches. >>>>> >>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md >>>>>> index 654d9d5..28c2e2a 100644 >>>>>> --- a/gcc/config/arm/neon.md >>>>>> +++ b/gcc/config/arm/neon.md >>>>>> @@ -548,6 +548,32 @@ >>>>>> (const_string "neon_mul_<V_elem_ch><q>")))] >>>>>> ) >>>>>> >>>>> >>>>> Please add a comment here. >>>>> >>>>>> +(define_expand "div<mode>3" >>>>>> + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") >>>>>> + (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") >>>>>> + (match_operand:VCVTF 2 "s_register_operand" "w")))] >>>>> >>>>> I want to double check that this doesn't collide with Alan's patches for >>>>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 >>>>> cases. >>>>> >>>>>> + "TARGET_NEON && flag_unsafe_math_optimizations && >>>>>> flag_reciprocal_math" >>>>>> + { >>>>>> + rtx rec = gen_reg_rtx (<MODE>mode); >>>>>> + rtx vrecps_temp = gen_reg_rtx (<MODE>mode); >>>>>> + >>>>>> + /* Reciprocal estimate */ >>>>>> + emit_insn (gen_neon_vrecpe<mode> (rec, operands[2])); >>>>>> + >>>>>> + /* Perform 2 iterations of Newton-Raphson method for better >>>>>> accuracy */ >>>>>> + for (int i = 0; i < 2; i++) >>>>>> + { >>>>>> + emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2])); >>>>>> + emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp)); >>>>>> + } >>>>>> + >>>>>> + /* We now have reciprocal in rec, perform operands[0] = operands[1] >>>>>> * rec */ >>>>>> + emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec)); >>>>>> + DONE; >>>>>> + } >>>>>> +) >>>>>> + >>>>>> + >>>>>> (define_insn "mul<mode>3add<mode>_neon" >>>>>> [(set (match_operand:VDQW 0 "s_register_operand" "=w") >>>>>> (plus:VDQW (mult:VDQW (match_operand:VDQW 2 >>>>>> "s_register_operand" "w") >>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>> new file mode 100644 >>>>>> index 0000000..e562ef3 >>>>>> --- /dev/null >>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>> @@ -0,0 +1,14 @@ >>>>>> +/* { dg-do compile } */ >>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize >>>>>> -fdump-tree-vect-all" } */ >>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>> >>>>> No this is wrong. >>>>> >>>>> What is armv8 specific about this test ? This is just like another test >>>>> that is for Neon. vrecpe / vrecps are not instructions that were >>>>> introduced in the v8 version of the architecture. They've existed in the >>>>> base Neon instruction set. The code generation above in the patterns will >>>>> be enabled when TARGET_NEON is true which can happen when -mfpu=neon >>>>> -mfloat-abi={softfp/hard} is true. >>>>> >>>>>> + >>>>>> +void >>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>> +{ >>>>>> + len = len & ~31; >>>>>> + for (int i = 0; i < len; i++) >>>>>> + p[i] = p[i] / x[i]; >>>>>> +} >>>>>> + >>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } >>>>>> */ >>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>> new file mode 100644 >>>>>> index 0000000..8e15d0a >>>>>> --- /dev/null >>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>> @@ -0,0 +1,14 @@ >>>>>> +/* { dg-do compile } */ >>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>> >>>>> And likewise. >>>>> >>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math >>>>>> -ftree-vectorize -fdump-tree-vect-all" } */ >>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>>> + >>>>>> +void >>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>> +{ >>>>>> + len = len & ~31; >>>>>> + for (int i = 0; i < len; i++) >>>>>> + p[i] = p[i] / x[i]; >>>>>> +} >>>>>> + >>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } >>>>>> */ >>>>> >>>>> >>>>> regards >>>>> Ramana