https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #38 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #37)
> Even more iteration looks bad.  I do wonder why when gather can avoid
> peeling for GAPs using load-lanes cannot?
Like you say, we don't realise that all the loads from array3[i] form a single
group.

Note that we're not using load-lanes in either case, since the group size (8)
is too big for that.  But load-lanes and load-and-permute have the same
restriction about when peeling for gaps is required.

In contrast, gather loads only ever load data that they actually need.

> Also for the stores we seem to use elementwise stores rather than store-lanes.
What configuration are you trying?  The original report was about SVE, so I was
trying that.  There we use a scatter store.

> To me the most obvious thing to try optimizing in this testcase is DR
> analysis.  With -march=armv8.3-a I still see
> 
> t.c:26:22: note:   === vect_analyze_data_ref_accesses ===
> t.c:26:22: note:   Detected single element interleaving array1[0][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[1][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[2][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[3][_8] step 4
> t.c:26:22: note:   Detected single element interleaving array1[0][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[1][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[2][_1] step 4
> t.c:26:22: note:   Detected single element interleaving array1[3][_1] step 4
> t.c:26:22: missed:   not consecutive access array2[_4][_8] = _69;
> t.c:26:22: note:   using strided accesses
> t.c:26:22: missed:   not consecutive access array2[_4][_1] = _67;
> t.c:26:22: note:   using strided accesses
> 
> so we don't figure
> 
> Creating dr for array1[0][_1]
>         base_address: &array1
>         offset from base address: (ssizetype) ((sizetype) (m_111 * 2) * 2)
>         constant offset from base address: 0
>         step: 4
>         base alignment: 16
>         base misalignment: 0
>         offset alignment: 4
>         step alignment: 4
>         base_object: array1
>         Access function 0: {m_111 * 2, +, 2}<nw>_4
>         Access function 1: 0
> Creating dr for array1[0][_8]
> analyze_innermost: success.
>         base_address: &array1
>         offset from base address: (ssizetype) ((sizetype) (m_111 * 2 + 1) *
> 2)
>         constant offset from base address: 0
>         step: 4
>         base alignment: 16
>         base misalignment: 0
>         offset alignment: 2
>         step alignment: 4
>         base_object: array1
>         Access function 0: {m_111 * 2 + 1, +, 2}<nw>_4
>         Access function 1: 0
> 
> belong to the same group (but the access functions tell us it worked out).
> Above we fail to split the + 1 to the constant offset.
OK, but this is moving the question on to how we should optimise the testcase
for Advanced SIMD rather than SVE, and how we should optimise the testcase in
general, rather than simply recover what we could do before.  (SVE is only
enabled for -march=arvm9-a and above, in case armv8.3-a was intended to enable
SVE too.)

Reply via email to