Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Ramana Radhakrishnan Tue, 07 Jun 2016 01:37:44 -0700

>> Please find the updated patch attached.
>> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and
>> arm-none-eabi.
>> However the test-case added in the patch (neon-vect-div-1.c) fails to
>> get vectorized at -O2
>> for armeb-none-linux-gnueabihf.
>> Charles suggested me to try with -O3, which worked.
>> It appears the test-case fails to get vectorized with
>> -fvect-cost-model=cheap (which is default enabled at -O2)
>> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic
>>
>> I can't figure out why it fails -fvect-cost-model=cheap.
>> From the vect dump (attached):
>> neon-vect-div-1.c:12:3: note: Setting misalignment to -1.
>> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9
> Hi,
> I think I have some idea why the test-case fails attached with patch
> fail to get vectorized on armeb with -O2.
>
> Issue with big endian vectorizer:
> The patch does not cause regressions on big endian vectorizer but
> fails to vectorize the test-cases attached with the patch, while they
> get vectorized on
> litttle-endian.
> Fails with armeb with the following message in dump:
> note: not vectorized: unsupported unaligned load.*_9
>
> The behavior of big and little endian vectorizer seems to be different
> in arm_builtin_support_vector_misalignment() which overrides the hook
> targetm.vectorize.support_vector_misalignment().
>
> targetm.vectorize.support_vector_misalignment is called by
> vect_supportable_dr_alignment () which in turn is called
> by verify_data_refs_alignment ().
>
> Execution upto following condition is common between arm and armeb
> in vect_supportable_dr_alignment():
>
> if ((TYPE_USER_ALIGN (type) && !is_packed)
>       || targetm.vectorize.support_vector_misalignment (mode, type,
>                                             DR_MISALIGNMENT (dr), is_packed))
>         /* Can't software pipeline the loads, but can at least do them.  */
>         return dr_unaligned_supported;
>
> For little endian case:
> arm_builtin_support_vector_misalignment() is called with
> V2SF mode and misalignment == -1, and the following condition
> becomes true:
> /* If the misalignment is unknown, we should be able to handle the access
>          so long as it is not to a member of a packed data structure.  */
>   if (misalignment == -1)
>     return true;
>
> Since the hook returned true we enter the condition above in
> vect_supportable_dr_alignment() and return dr_unaligned_supported;
>
> For big-endian:
> arm_builtin_support_vector_misalignment() is called with V2SF mode.
> The following condition that gates the entire function body fails:
>  if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
> and the default hook gets called with V2SF mode and the default hook
> returns false because
> movmisalign_optab does not exist for V2SF mode.
>
> So the condition above in vect_supportable_dr_alignment() fails
> and we come here:
>  /* Unsupported.  */
> return dr_unaligned_unsupported;
>
> And hence we get the unaligned load not supported message in the dump
> for armeb in verify_data_ref_alignment ():
>
> static bool
> verify_data_ref_alignment (data_reference_p dr)
> {
>   enum dr_alignment_support supportable_dr_alignment
>     = vect_supportable_dr_alignment (dr, false);
>   if (!supportable_dr_alignment)
>     {
>       if (dump_enabled_p ())
>         {
>           if (DR_IS_READ (dr))
>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                              "not vectorized: unsupported unaligned load.");
>
> With -O3, the test-cases vectorize for armeb, because loop peeling for 
> alignment
> is turned on.
> The above behavior is also reproducible with test-case which is
> irrelevant to the patch.
> for instance, we get the same unsupported unaligned load for following
> test-case (replaced / with +)
>
> void
> foo (int len, float * __restrict p, float *__restrict x)
> {
>   len = len & ~31;
>   for (int i = 0; i < len; i++)
>     p[i] = p[i] + x[i];
> }
> Is the patch OK to commit after bootstrap+test ?



Thanks for the analysis - all the test needs is an additional marker
to skip it on armeb (is there a helper for misaligned loads from the
vectorizer ? ) - Ah probably vect_hw_misalign is sufficient for your
usecase and you want to appropriately fix it for little endian arm
with neon support enabled.

From the patch.

>>+   && flag_unsafe_math_optimizations && flag_reciprocal_math"

Why do we need flag_unsafe_math_optimizations && flag_reciprocal_math
? flag_unsafe_math_optimizations should be sufficient since it enables
flag_reciprocal_math - the reason for flag_unsafe_math_optimizations
is to prevent loss of precision and the fact that on neon denormalized
numbers are flushed to zero.

Ok with that change and a quick test with vect_hw_misalign added to
your testcase.

Sorry about the delay in reviewing.

Ramana


>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Prathamesh
>>>
>>> Thanks,
>>> Ramana
>>>
>>>>
>>>> Thanks,
>>>> Prathamesh
>>>>>
>>>>>
>>>>> moving on to the patches.
>>>>>
>>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>>>>> index 654d9d5..28c2e2a 100644
>>>>>> --- a/gcc/config/arm/neon.md
>>>>>> +++ b/gcc/config/arm/neon.md
>>>>>> @@ -548,6 +548,32 @@
>>>>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>>>>  )
>>>>>>
>>>>>
>>>>> Please add a comment here.
>>>>>
>>>>>> +(define_expand "div<mode>3"
>>>>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
>>>>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>>>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>>>>
>>>>> I want to double check that this doesn't collide with Alan's patches for 
>>>>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 
>>>>> cases.
>>>>>
>>>>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && 
>>>>>> flag_reciprocal_math"
>>>>>> +  {
>>>>>> +    rtx rec = gen_reg_rtx (<MODE>mode);
>>>>>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
>>>>>> +
>>>>>> +    /* Reciprocal estimate */
>>>>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>>>>> +
>>>>>> +    /* Perform 2 iterations of Newton-Raphson method for better 
>>>>>> accuracy */
>>>>>> +    for (int i = 0; i < 2; i++)
>>>>>> +      {
>>>>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>>>>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>>>>> +      }
>>>>>> +
>>>>>> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] 
>>>>>> * rec */
>>>>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>>>>> +    DONE;
>>>>>> +  }
>>>>>> +)
>>>>>> +
>>>>>> +
>>>>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>>>>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 
>>>>>> "s_register_operand" "w")
>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c 
>>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>>> new file mode 100644
>>>>>> index 0000000..e562ef3
>>>>>> --- /dev/null
>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>>> @@ -0,0 +1,14 @@
>>>>>> +/* { dg-do compile } */
>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize 
>>>>>> -fdump-tree-vect-all" } */
>>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>>
>>>>> No this is wrong.
>>>>>
>>>>> What is armv8 specific about this test ? This is just like another test 
>>>>> that is for Neon. vrecpe / vrecps are not instructions that were 
>>>>> introduced in the v8 version of the architecture. They've existed in the 
>>>>> base Neon instruction set. The code generation above in the patterns will 
>>>>> be enabled when TARGET_NEON is true which can happen when -mfpu=neon 
>>>>> -mfloat-abi={softfp/hard} is true.
>>>>>
>>>>>> +
>>>>>> +void
>>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>>> +{
>>>>>> +  len = len & ~31;
>>>>>> +  for (int i = 0; i < len; i++)
>>>>>> +    p[i] = p[i] / x[i];
>>>>>> +}
>>>>>> +
>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } 
>>>>>> */
>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c 
>>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>>> new file mode 100644
>>>>>> index 0000000..8e15d0a
>>>>>> --- /dev/null
>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>>> @@ -0,0 +1,14 @@
>>>>>> +/* { dg-do compile } */
>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>>
>>>>> And likewise.
>>>>>
>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math 
>>>>>> -ftree-vectorize -fdump-tree-vect-all" } */
>>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>>> +
>>>>>> +void
>>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>>> +{
>>>>>> +  len = len & ~31;
>>>>>> +  for (int i = 0; i < len; i++)
>>>>>> +    p[i] = p[i] / x[i];
>>>>>> +}
>>>>>> +
>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } 
>>>>>> */
>>>>>
>>>>>
>>>>> regards
>>>>> Ramana

Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Reply via email to