On 5 February 2016 at 18:40, Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> wrote: > On 4 February 2016 at 16:31, Ramana Radhakrishnan > <ramana....@googlemail.com> wrote: >> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni >> <prathamesh.kulka...@linaro.org> wrote: >>> On 31 July 2015 at 15:04, Ramana Radhakrishnan >>> <ramana.radhakrish...@foss.arm.com> wrote: >>>> >>>> >>>> On 29/07/15 11:09, Prathamesh Kulkarni wrote: >>>>> Hi, >>>>> This patch tries to implement division with multiplication by >>>>> reciprocal using vrecpe/vrecps >>>>> with -funsafe-math-optimizations and -freciprocal-math enabled. >>>>> Tested on arm-none-linux-gnueabihf using qemu. >>>>> OK for trunk ? >>>>> >>>>> Thank you, >>>>> Prathamesh >>>>> >>>> >>>> I've tried this in the past and never been convinced that 2 iterations are >>>> enough to get to stability with this given that the results are only >>>> precise for 8 bits / iteration. Thus I've always believed you need 3 >>>> iterations rather than 2 at which point I've never been sure that it's >>>> worth it. So the testing that you've done with this currently is not >>>> enough for this to go into the tree. >>>> >>>> I'd like this to be tested on a couple of different AArch32 >>>> implementations with a wider range of inputs to verify that the results >>>> are acceptable as well as running something like SPEC2k(6) with atleast >>>> one iteration to ensure correctness. >>> Hi, >>> I got results of SPEC2k6 fp benchmarks: >>> a15: +0.64% overall, 481.wrf: +6.46% >>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% >>> a57: +0.35% overall, 481.wrf: +3.84% >>> The other benchmarks had (almost) identical results. >> >> Thanks for the benchmarking results - Please repost the patch with >> the changes that I had requested in my previous review - given it is >> now stage4 , I would rather queue changes like this for stage1 now. > Hi, > Please find the updated patch attached. > It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and > arm-none-eabi. > However the test-case added in the patch (neon-vect-div-1.c) fails to > get vectorized at -O2 > for armeb-none-linux-gnueabihf. > Charles suggested me to try with -O3, which worked. > It appears the test-case fails to get vectorized with > -fvect-cost-model=cheap (which is default enabled at -O2) > and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic > > I can't figure out why it fails -fvect-cost-model=cheap. > From the vect dump (attached): > neon-vect-div-1.c:12:3: note: Setting misalignment to -1. > neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 Hi, I think I have some idea why the test-case fails attached with patch fail to get vectorized on armeb with -O2.
Issue with big endian vectorizer: The patch does not cause regressions on big endian vectorizer but fails to vectorize the test-cases attached with the patch, while they get vectorized on litttle-endian. Fails with armeb with the following message in dump: note: not vectorized: unsupported unaligned load.*_9 The behavior of big and little endian vectorizer seems to be different in arm_builtin_support_vector_misalignment() which overrides the hook targetm.vectorize.support_vector_misalignment(). targetm.vectorize.support_vector_misalignment is called by vect_supportable_dr_alignment () which in turn is called by verify_data_refs_alignment (). Execution upto following condition is common between arm and armeb in vect_supportable_dr_alignment(): if ((TYPE_USER_ALIGN (type) && !is_packed) || targetm.vectorize.support_vector_misalignment (mode, type, DR_MISALIGNMENT (dr), is_packed)) /* Can't software pipeline the loads, but can at least do them. */ return dr_unaligned_supported; For little endian case: arm_builtin_support_vector_misalignment() is called with V2SF mode and misalignment == -1, and the following condition becomes true: /* If the misalignment is unknown, we should be able to handle the access so long as it is not to a member of a packed data structure. */ if (misalignment == -1) return true; Since the hook returned true we enter the condition above in vect_supportable_dr_alignment() and return dr_unaligned_supported; For big-endian: arm_builtin_support_vector_misalignment() is called with V2SF mode. The following condition that gates the entire function body fails: if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) and the default hook gets called with V2SF mode and the default hook returns false because movmisalign_optab does not exist for V2SF mode. So the condition above in vect_supportable_dr_alignment() fails and we come here: /* Unsupported. */ return dr_unaligned_unsupported; And hence we get the unaligned load not supported message in the dump for armeb in verify_data_ref_alignment (): static bool verify_data_ref_alignment (data_reference_p dr) { enum dr_alignment_support supportable_dr_alignment = vect_supportable_dr_alignment (dr, false); if (!supportable_dr_alignment) { if (dump_enabled_p ()) { if (DR_IS_READ (dr)) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "not vectorized: unsupported unaligned load."); With -O3, the test-cases vectorize for armeb, because loop peeling for alignment is turned on. The above behavior is also reproducible with test-case which is irrelevant to the patch. for instance, we get the same unsupported unaligned load for following test-case (replaced / with +) void foo (int len, float * __restrict p, float *__restrict x) { len = len & ~31; for (int i = 0; i < len; i++) p[i] = p[i] + x[i]; } Is the patch OK to commit after bootstrap+test ? Thanks, Prathamesh > > Thanks, > Prathamesh >> >> Thanks, >> Ramana >> >>> >>> Thanks, >>> Prathamesh >>>> >>>> >>>> moving on to the patches. >>>> >>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md >>>>> index 654d9d5..28c2e2a 100644 >>>>> --- a/gcc/config/arm/neon.md >>>>> +++ b/gcc/config/arm/neon.md >>>>> @@ -548,6 +548,32 @@ >>>>> (const_string "neon_mul_<V_elem_ch><q>")))] >>>>> ) >>>>> >>>> >>>> Please add a comment here. >>>> >>>>> +(define_expand "div<mode>3" >>>>> + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") >>>>> + (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") >>>>> + (match_operand:VCVTF 2 "s_register_operand" "w")))] >>>> >>>> I want to double check that this doesn't collide with Alan's patches for >>>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases. >>>> >>>>> + "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math" >>>>> + { >>>>> + rtx rec = gen_reg_rtx (<MODE>mode); >>>>> + rtx vrecps_temp = gen_reg_rtx (<MODE>mode); >>>>> + >>>>> + /* Reciprocal estimate */ >>>>> + emit_insn (gen_neon_vrecpe<mode> (rec, operands[2])); >>>>> + >>>>> + /* Perform 2 iterations of Newton-Raphson method for better accuracy >>>>> */ >>>>> + for (int i = 0; i < 2; i++) >>>>> + { >>>>> + emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2])); >>>>> + emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp)); >>>>> + } >>>>> + >>>>> + /* We now have reciprocal in rec, perform operands[0] = operands[1] >>>>> * rec */ >>>>> + emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec)); >>>>> + DONE; >>>>> + } >>>>> +) >>>>> + >>>>> + >>>>> (define_insn "mul<mode>3add<mode>_neon" >>>>> [(set (match_operand:VDQW 0 "s_register_operand" "=w") >>>>> (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" >>>>> "w") >>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>> new file mode 100644 >>>>> index 0000000..e562ef3 >>>>> --- /dev/null >>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>> @@ -0,0 +1,14 @@ >>>>> +/* { dg-do compile } */ >>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize >>>>> -fdump-tree-vect-all" } */ >>>>> +/* { dg-add-options arm_v8_neon } */ >>>> >>>> No this is wrong. >>>> >>>> What is armv8 specific about this test ? This is just like another test >>>> that is for Neon. vrecpe / vrecps are not instructions that were >>>> introduced in the v8 version of the architecture. They've existed in the >>>> base Neon instruction set. The code generation above in the patterns will >>>> be enabled when TARGET_NEON is true which can happen when -mfpu=neon >>>> -mfloat-abi={softfp/hard} is true. >>>> >>>>> + >>>>> +void >>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>> +{ >>>>> + len = len & ~31; >>>>> + for (int i = 0; i < len; i++) >>>>> + p[i] = p[i] / x[i]; >>>>> +} >>>>> + >>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ >>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>> new file mode 100644 >>>>> index 0000000..8e15d0a >>>>> --- /dev/null >>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>> @@ -0,0 +1,14 @@ >>>>> +/* { dg-do compile } */ >>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>> >>>> And likewise. >>>> >>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math >>>>> -ftree-vectorize -fdump-tree-vect-all" } */ >>>>> +/* { dg-add-options arm_v8_neon } */ >>>>> + >>>>> +void >>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>> +{ >>>>> + len = len & ~31; >>>>> + for (int i = 0; i < len; i++) >>>>> + p[i] = p[i] / x[i]; >>>>> +} >>>>> + >>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */ >>>> >>>> >>>> regards >>>> Ramana
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 6b4896d..862e31b 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -578,6 +578,38 @@ (const_string "neon_mul_<V_elem_ch><q>")))] ) +/* Perform division using multiply-by-reciprocal. + Reciprocal is calculated using Newton-Raphson method. + Enabled with -funsafe-math-optimizations -freciprocal-math + and disabled for -Os since it increases code size . */ + +(define_expand "div<mode>3" + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") + (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") + (match_operand:VCVTF 2 "s_register_operand" "w")))] + "TARGET_NEON && !optimize_size + && flag_unsafe_math_optimizations && flag_reciprocal_math" + { + rtx rec = gen_reg_rtx (<MODE>mode); + rtx vrecps_temp = gen_reg_rtx (<MODE>mode); + + /* Reciprocal estimate. */ + emit_insn (gen_neon_vrecpe<mode> (rec, operands[2])); + + /* Perform 2 iterations of newton-raphson method. */ + for (int i = 0; i < 2; i++) + { + emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2])); + emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp)); + } + + /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec. */ + emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec)); + DONE; + } +) + + (define_insn "mul<mode>3add<mode>_neon" [(set (match_operand:VDQW 0 "s_register_operand" "=w") (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w") diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c new file mode 100644 index 0000000..dae724a --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c @@ -0,0 +1,16 @@ +/* Test pattern div<mode>3. */ + +/* { dg-do compile } */ +/* { dg-require-effective-target arm_neon_ok } */ +/* { dg-options "-O3 -funsafe-math-optimizations -fdump-tree-vect-all" } */ +/* { dg-add-options arm_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len & ~31; + for (int i = 0; i < len; i++) + p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c new file mode 100644 index 0000000..0450b70 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c @@ -0,0 +1,16 @@ +/* Test pattern div<mode>3. */ + +/* { dg-do compile } */ +/* { dg-require-effective-target arm_neon_ok } */ +/* { dg-options "-O3 -funsafe-math-optimizations -fno-reciprocal-math -fdump-tree-vect-all" } */ +/* { dg-add-options arm_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len & ~31; + for (int i = 0; i < len; i++) + p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
ChangeLog
Description: Binary data