Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Prathamesh Kulkarni Mon, 23 May 2016 02:31:01 -0700

On 5 February 2016 at 18:40, Prathamesh Kulkarni
<prathamesh.kulka...@linaro.org> wrote:
> On 4 February 2016 at 16:31, Ramana Radhakrishnan
> <ramana....@googlemail.com> wrote:
>> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni
>> <prathamesh.kulka...@linaro.org> wrote:
>>> On 31 July 2015 at 15:04, Ramana Radhakrishnan
>>> <ramana.radhakrish...@foss.arm.com> wrote:
>>>>
>>>>
>>>> On 29/07/15 11:09, Prathamesh Kulkarni wrote:
>>>>> Hi,
>>>>> This patch tries to implement division with multiplication by
>>>>> reciprocal using vrecpe/vrecps
>>>>> with -funsafe-math-optimizations and -freciprocal-math enabled.
>>>>> Tested on arm-none-linux-gnueabihf using qemu.
>>>>> OK for trunk ?
>>>>>
>>>>> Thank you,
>>>>> Prathamesh
>>>>>
>>>>
>>>> I've tried this in the past and never been convinced that 2 iterations are 
>>>> enough to get to stability with this given that the results are only 
>>>> precise for 8 bits / iteration. Thus I've always believed you need 3 
>>>> iterations rather than 2 at which point I've never been sure that it's 
>>>> worth it. So the testing that you've done with this currently is not 
>>>> enough for this to go into the tree.
>>>>
>>>> I'd like this to be tested on a couple of different AArch32 
>>>> implementations with a wider range of inputs to verify that the results 
>>>> are acceptable as well as running something like SPEC2k(6) with atleast 
>>>> one iteration to ensure correctness.
>>> Hi,
>>> I got results of SPEC2k6 fp benchmarks:
>>> a15: +0.64% overall, 481.wrf: +6.46%
>>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
>>> a57: +0.35% overall, 481.wrf: +3.84%
>>> The other benchmarks had (almost) identical results.
>>
>> Thanks for the benchmarking results -  Please repost the patch with
>> the changes that I had requested in my previous review - given it is
>> now stage4 , I would rather queue changes like this for stage1 now.
> Hi,
> Please find the updated patch attached.
> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and
> arm-none-eabi.
> However the test-case added in the patch (neon-vect-div-1.c) fails to
> get vectorized at -O2
> for armeb-none-linux-gnueabihf.
> Charles suggested me to try with -O3, which worked.
> It appears the test-case fails to get vectorized with
> -fvect-cost-model=cheap (which is default enabled at -O2)
> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic
>
> I can't figure out why it fails -fvect-cost-model=cheap.
> From the vect dump (attached):
> neon-vect-div-1.c:12:3: note: Setting misalignment to -1.
> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9
Hi,
I think I have some idea why the test-case fails attached with patch
fail to get vectorized on armeb with -O2.


Issue with big endian vectorizer:
The patch does not cause regressions on big endian vectorizer but
fails to vectorize the test-cases attached with the patch, while they
get vectorized on
litttle-endian.
Fails with armeb with the following message in dump:
note: not vectorized: unsupported unaligned load.*_9

The behavior of big and little endian vectorizer seems to be different
in arm_builtin_support_vector_misalignment() which overrides the hook
targetm.vectorize.support_vector_misalignment().

targetm.vectorize.support_vector_misalignment is called by
vect_supportable_dr_alignment () which in turn is called
by verify_data_refs_alignment ().

Execution upto following condition is common between arm and armeb
in vect_supportable_dr_alignment():

if ((TYPE_USER_ALIGN (type) && !is_packed)
      || targetm.vectorize.support_vector_misalignment (mode, type,
                                            DR_MISALIGNMENT (dr), is_packed))
        /* Can't software pipeline the loads, but can at least do them.  */
        return dr_unaligned_supported;

For little endian case:
arm_builtin_support_vector_misalignment() is called with
V2SF mode and misalignment == -1, and the following condition
becomes true:
/* If the misalignment is unknown, we should be able to handle the access
         so long as it is not to a member of a packed data structure.  */
  if (misalignment == -1)
    return true;

Since the hook returned true we enter the condition above in
vect_supportable_dr_alignment() and return dr_unaligned_supported;

For big-endian:
arm_builtin_support_vector_misalignment() is called with V2SF mode.
The following condition that gates the entire function body fails:
 if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
and the default hook gets called with V2SF mode and the default hook
returns false because
movmisalign_optab does not exist for V2SF mode.

So the condition above in vect_supportable_dr_alignment() fails
and we come here:
 /* Unsupported.  */
return dr_unaligned_unsupported;

And hence we get the unaligned load not supported message in the dump
for armeb in verify_data_ref_alignment ():

static bool
verify_data_ref_alignment (data_reference_p dr)
{
  enum dr_alignment_support supportable_dr_alignment
    = vect_supportable_dr_alignment (dr, false);
  if (!supportable_dr_alignment)
    {
      if (dump_enabled_p ())
        {
          if (DR_IS_READ (dr))
            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                             "not vectorized: unsupported unaligned load.");

With -O3, the test-cases vectorize for armeb, because loop peeling for alignment
is turned on.
The above behavior is also reproducible with test-case which is
irrelevant to the patch.
for instance, we get the same unsupported unaligned load for following
test-case (replaced / with +)

void
foo (int len, float * __restrict p, float *__restrict x)
{
  len = len & ~31;
  for (int i = 0; i < len; i++)
    p[i] = p[i] + x[i];
}
Is the patch OK to commit after bootstrap+test ?

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Ramana
>>
>>>
>>> Thanks,
>>> Prathamesh
>>>>
>>>>
>>>> moving on to the patches.
>>>>
>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>>>> index 654d9d5..28c2e2a 100644
>>>>> --- a/gcc/config/arm/neon.md
>>>>> +++ b/gcc/config/arm/neon.md
>>>>> @@ -548,6 +548,32 @@
>>>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>>>  )
>>>>>
>>>>
>>>> Please add a comment here.
>>>>
>>>>> +(define_expand "div<mode>3"
>>>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
>>>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>>>
>>>> I want to double check that this doesn't collide with Alan's patches for 
>>>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases.
>>>>
>>>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math"
>>>>> +  {
>>>>> +    rtx rec = gen_reg_rtx (<MODE>mode);
>>>>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
>>>>> +
>>>>> +    /* Reciprocal estimate */
>>>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>>>> +
>>>>> +    /* Perform 2 iterations of Newton-Raphson method for better accuracy 
>>>>> */
>>>>> +    for (int i = 0; i < 2; i++)
>>>>> +      {
>>>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>>>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>>>> +      }
>>>>> +
>>>>> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] 
>>>>> * rec */
>>>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>>>> +    DONE;
>>>>> +  }
>>>>> +)
>>>>> +
>>>>> +
>>>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>>>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" 
>>>>> "w")
>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c 
>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>> new file mode 100644
>>>>> index 0000000..e562ef3
>>>>> --- /dev/null
>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>> @@ -0,0 +1,14 @@
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize 
>>>>> -fdump-tree-vect-all" } */
>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>
>>>> No this is wrong.
>>>>
>>>> What is armv8 specific about this test ? This is just like another test 
>>>> that is for Neon. vrecpe / vrecps are not instructions that were 
>>>> introduced in the v8 version of the architecture. They've existed in the 
>>>> base Neon instruction set. The code generation above in the patterns will 
>>>> be enabled when TARGET_NEON is true which can happen when -mfpu=neon 
>>>> -mfloat-abi={softfp/hard} is true.
>>>>
>>>>> +
>>>>> +void
>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>> +{
>>>>> +  len = len & ~31;
>>>>> +  for (int i = 0; i < len; i++)
>>>>> +    p[i] = p[i] / x[i];
>>>>> +}
>>>>> +
>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c 
>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>> new file mode 100644
>>>>> index 0000000..8e15d0a
>>>>> --- /dev/null
>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>> @@ -0,0 +1,14 @@
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>
>>>> And likewise.
>>>>
>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math 
>>>>> -ftree-vectorize -fdump-tree-vect-all" } */
>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>> +
>>>>> +void
>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>> +{
>>>>> +  len = len & ~31;
>>>>> +  for (int i = 0; i < len; i++)
>>>>> +    p[i] = p[i] / x[i];
>>>>> +}
>>>>> +
>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
>>>>
>>>>
>>>> regards
>>>> Ramana

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 6b4896d..862e31b 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -578,6 +578,38 @@
                     (const_string "neon_mul_<V_elem_ch><q>")))]
 )
 
+/* Perform division using multiply-by-reciprocal. 
+   Reciprocal is calculated using Newton-Raphson method.
+   Enabled with -funsafe-math-optimizations -freciprocal-math
+   and disabled for -Os since it increases code size .  */
+
+(define_expand "div<mode>3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
+                 (match_operand:VCVTF 2 "s_register_operand" "w")))]
+  "TARGET_NEON && !optimize_size
+   && flag_unsafe_math_optimizations && flag_reciprocal_math"
+  {
+    rtx rec = gen_reg_rtx (<MODE>mode);
+    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
+
+    /* Reciprocal estimate.  */
+    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
+
+    /* Perform 2 iterations of newton-raphson method.  */
+    for (int i = 0; i < 2; i++)
+      {
+       emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
+       emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
+      }
+
+    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec. 
 */
+    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
+    DONE;
+  }
+)
+
+
 (define_insn "mul<mode>3add<mode>_neon"
   [(set (match_operand:VDQW 0 "s_register_operand" "=w")
         (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c 
b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
new file mode 100644
index 0000000..dae724a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O3 -funsafe-math-optimizations -fdump-tree-vect-all" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c 
b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
new file mode 100644
index 0000000..0450b70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O3 -funsafe-math-optimizations -fno-reciprocal-math 
-fdump-tree-vect-all" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */

ChangeLog
Description: Binary data

Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Reply via email to