>> Please find the updated patch attached. >> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and >> arm-none-eabi. >> However the test-case added in the patch (neon-vect-div-1.c) fails to >> get vectorized at -O2 >> for armeb-none-linux-gnueabihf. >> Charles suggested me to try with -O3, which worked. >> It appears the test-case fails to get vectorized with >> -fvect-cost-model=cheap (which is default enabled at -O2) >> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic >> >> I can't figure out why it fails -fvect-cost-model=cheap. >> From the vect dump (attached): >> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 > Hi, > I think I have some idea why the test-case fails attached with patch > fail to get vectorized on armeb with -O2. > > Issue with big endian vectorizer: > The patch does not cause regressions on big endian vectorizer but > fails to vectorize the test-cases attached with the patch, while they > get vectorized on > litttle-endian. > Fails with armeb with the following message in dump: > note: not vectorized: unsupported unaligned load.*_9 > > The behavior of big and little endian vectorizer seems to be different > in arm_builtin_support_vector_misalignment() which overrides the hook > targetm.vectorize.support_vector_misalignment(). > > targetm.vectorize.support_vector_misalignment is called by > vect_supportable_dr_alignment () which in turn is called > by verify_data_refs_alignment (). > > Execution upto following condition is common between arm and armeb > in vect_supportable_dr_alignment(): > > if ((TYPE_USER_ALIGN (type) && !is_packed) > || targetm.vectorize.support_vector_misalignment (mode, type, > DR_MISALIGNMENT (dr), is_packed)) > /* Can't software pipeline the loads, but can at least do them. */ > return dr_unaligned_supported; > > For little endian case: > arm_builtin_support_vector_misalignment() is called with > V2SF mode and misalignment == -1, and the following condition > becomes true: > /* If the misalignment is unknown, we should be able to handle the access > so long as it is not to a member of a packed data structure. */ > if (misalignment == -1) > return true; > > Since the hook returned true we enter the condition above in > vect_supportable_dr_alignment() and return dr_unaligned_supported; > > For big-endian: > arm_builtin_support_vector_misalignment() is called with V2SF mode. > The following condition that gates the entire function body fails: > if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) > and the default hook gets called with V2SF mode and the default hook > returns false because > movmisalign_optab does not exist for V2SF mode. > > So the condition above in vect_supportable_dr_alignment() fails > and we come here: > /* Unsupported. */ > return dr_unaligned_unsupported; > > And hence we get the unaligned load not supported message in the dump > for armeb in verify_data_ref_alignment (): > > static bool > verify_data_ref_alignment (data_reference_p dr) > { > enum dr_alignment_support supportable_dr_alignment > = vect_supportable_dr_alignment (dr, false); > if (!supportable_dr_alignment) > { > if (dump_enabled_p ()) > { > if (DR_IS_READ (dr)) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "not vectorized: unsupported unaligned load."); > > With -O3, the test-cases vectorize for armeb, because loop peeling for > alignment > is turned on. > The above behavior is also reproducible with test-case which is > irrelevant to the patch. > for instance, we get the same unsupported unaligned load for following > test-case (replaced / with +) > > void > foo (int len, float * __restrict p, float *__restrict x) > { > len = len & ~31; > for (int i = 0; i < len; i++) > p[i] = p[i] + x[i]; > } > Is the patch OK to commit after bootstrap+test ?
Thanks for the analysis - all the test needs is an additional marker to skip it on armeb (is there a helper for misaligned loads from the vectorizer ? ) - Ah probably vect_hw_misalign is sufficient for your usecase and you want to appropriately fix it for little endian arm with neon support enabled. From the patch. >>+ && flag_unsafe_math_optimizations && flag_reciprocal_math" Why do we need flag_unsafe_math_optimizations && flag_reciprocal_math ? flag_unsafe_math_optimizations should be sufficient since it enables flag_reciprocal_math - the reason for flag_unsafe_math_optimizations is to prevent loss of precision and the fact that on neon denormalized numbers are flushed to zero. Ok with that change and a quick test with vect_hw_misalign added to your testcase. Sorry about the delay in reviewing. Ramana > > Thanks, > Prathamesh >> >> Thanks, >> Prathamesh >>> >>> Thanks, >>> Ramana >>> >>>> >>>> Thanks, >>>> Prathamesh >>>>> >>>>> >>>>> moving on to the patches. >>>>> >>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md >>>>>> index 654d9d5..28c2e2a 100644 >>>>>> --- a/gcc/config/arm/neon.md >>>>>> +++ b/gcc/config/arm/neon.md >>>>>> @@ -548,6 +548,32 @@ >>>>>> (const_string "neon_mul_<V_elem_ch><q>")))] >>>>>> ) >>>>>> >>>>> >>>>> Please add a comment here. >>>>> >>>>>> +(define_expand "div<mode>3" >>>>>> + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") >>>>>> + (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") >>>>>> + (match_operand:VCVTF 2 "s_register_operand" "w")))] >>>>> >>>>> I want to double check that this doesn't collide with Alan's patches for >>>>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 >>>>> cases. >>>>> >>>>>> + "TARGET_NEON && flag_unsafe_math_optimizations && >>>>>> flag_reciprocal_math" >>>>>> + { >>>>>> + rtx rec = gen_reg_rtx (<MODE>mode); >>>>>> + rtx vrecps_temp = gen_reg_rtx (<MODE>mode); >>>>>> + >>>>>> + /* Reciprocal estimate */ >>>>>> + emit_insn (gen_neon_vrecpe<mode> (rec, operands[2])); >>>>>> + >>>>>> + /* Perform 2 iterations of Newton-Raphson method for better >>>>>> accuracy */ >>>>>> + for (int i = 0; i < 2; i++) >>>>>> + { >>>>>> + emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2])); >>>>>> + emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp)); >>>>>> + } >>>>>> + >>>>>> + /* We now have reciprocal in rec, perform operands[0] = operands[1] >>>>>> * rec */ >>>>>> + emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec)); >>>>>> + DONE; >>>>>> + } >>>>>> +) >>>>>> + >>>>>> + >>>>>> (define_insn "mul<mode>3add<mode>_neon" >>>>>> [(set (match_operand:VDQW 0 "s_register_operand" "=w") >>>>>> (plus:VDQW (mult:VDQW (match_operand:VDQW 2 >>>>>> "s_register_operand" "w") >>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>> new file mode 100644 >>>>>> index 0000000..e562ef3 >>>>>> --- /dev/null >>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>> @@ -0,0 +1,14 @@ >>>>>> +/* { dg-do compile } */ >>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize >>>>>> -fdump-tree-vect-all" } */ >>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>> >>>>> No this is wrong. >>>>> >>>>> What is armv8 specific about this test ? This is just like another test >>>>> that is for Neon. vrecpe / vrecps are not instructions that were >>>>> introduced in the v8 version of the architecture. They've existed in the >>>>> base Neon instruction set. The code generation above in the patterns will >>>>> be enabled when TARGET_NEON is true which can happen when -mfpu=neon >>>>> -mfloat-abi={softfp/hard} is true. >>>>> >>>>>> + >>>>>> +void >>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>> +{ >>>>>> + len = len & ~31; >>>>>> + for (int i = 0; i < len; i++) >>>>>> + p[i] = p[i] / x[i]; >>>>>> +} >>>>>> + >>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } >>>>>> */ >>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>> new file mode 100644 >>>>>> index 0000000..8e15d0a >>>>>> --- /dev/null >>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>> @@ -0,0 +1,14 @@ >>>>>> +/* { dg-do compile } */ >>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>> >>>>> And likewise. >>>>> >>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math >>>>>> -ftree-vectorize -fdump-tree-vect-all" } */ >>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>>> + >>>>>> +void >>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>> +{ >>>>>> + len = len & ~31; >>>>>> + for (int i = 0; i < len; i++) >>>>>> + p[i] = p[i] / x[i]; >>>>>> +} >>>>>> + >>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } >>>>>> */ >>>>> >>>>> >>>>> regards >>>>> Ramana