https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021

--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Bill Schmidt from comment #20)
> We still don't vectorize the original code example on Power.  It appears
> that this is being disabled because of an alignment issue.  The data
> references are being rejected by:
> 
> product.f:9:0: note: can't force alignment of ref: REALPART_EXPR
> <*a.0_24[_50]>
> 
> and similar for the other three DRs.  This happens due to this code in
> vect_compute_data_ref_alignment:
> 
>   if (base_alignment < TYPE_ALIGN (vectype))
>     {
>       /* Strip an inner MEM_REF to a bare decl if possible.  */
>       if (TREE_CODE (base) == MEM_REF
>           && integer_zerop (TREE_OPERAND (base, 1))
>           && TREE_CODE (TREE_OPERAND (base, 0)) == ADDR_EXPR)
>         base = TREE_OPERAND (TREE_OPERAND (base, 0), 0);
> 
>       if (!vect_can_force_dr_alignment_p (base, TYPE_ALIGN (vectype)))
>         {
>           if (dump_enabled_p ())
>             {
>               dump_printf_loc (MSG_NOTE, vect_location,
>                                "can't force alignment of ref: ");
>               dump_generic_expr (MSG_NOTE, TDF_SLIM, ref);
>               dump_printf (MSG_NOTE, "\n");
>             }
>           return true;
>         }
> 
> Here TYPE_ALIGN (vectype) is 128 (Power vectors are normally aligned on a
> 128-bit value), and base_alignment is 64.  a.0 is defined as:
> 
> complex(kind=8) [0:D.1831] * restrict a.0;
> 
> In both ELFv1 and ELFv2 ABIs for Power, a complex type is defined to have
> the same alignment as the underlying type.  So "complex double" has 8-byte
> alignment.
> 
> On earlier versions of Power, the decision is fine, because unaligned
> accesses are expensive prior to POWER8.  With POWER8, though, an unaligned
> access will (most of the time) perform as well as an aligned access.  So
> ideally we would like to teach the vectorizer to allow vectorization here.
> 
> It seems like vect_supportable_dr_alignment ought to be considered as part
> of the SLP vectorization decision here, rather than just comparing the base
> alignment with the vector type alignment.  Adding a check for that allows
> things to get a little further, but we still don't vectorize the block.  (I
> haven't yet looked into why, but I assume more needs to be done downstream
> to handle this case.)
> 
> My understanding of the vectorizer is not yet very deep, so before going too
> far down the wrong path, I'd like your opinion on the best approach to
> fixing the problem.  Thanks!

I see it only failing due to cost issues (tried ppc64le and -mcpu=power8).
The unaligned loads cost 3 and we end up with

t.f90:8:0: note: Cost model analysis:
  Vector inside of loop cost: 40
  Vector prologue cost: 8
  Vector epilogue cost: 4
  Scalar iteration cost: 12
  Scalar outside cost: 6
  Vector outside cost: 12
  prologue iterations: 0
  epilogue iterations: 0
t.f90:8:0: note: cost model: the vector iteration cost = 40 divided by the
scalar iteration cost = 12 is greater or equal to the vectorization factor = 1.

Note that we are (still) not very good in estimating the SLP cost as we
account 4 vector loads here (because we essentially will end up with
4 different permutations used), so the "unaligned" part is accounted for
too much and likely the permutation cost as well.  Both are a limitation
of the SLP data structures and not easily fixable.  With
-fvect-cost-model=unlimited I see both loops vectorized.

> Bill

Reply via email to