https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Bill Schmidt from comment #20) > We still don't vectorize the original code example on Power. It appears > that this is being disabled because of an alignment issue. The data > references are being rejected by: > > product.f:9:0: note: can't force alignment of ref: REALPART_EXPR > <*a.0_24[_50]> > > and similar for the other three DRs. This happens due to this code in > vect_compute_data_ref_alignment: > > if (base_alignment < TYPE_ALIGN (vectype)) > { > /* Strip an inner MEM_REF to a bare decl if possible. */ > if (TREE_CODE (base) == MEM_REF > && integer_zerop (TREE_OPERAND (base, 1)) > && TREE_CODE (TREE_OPERAND (base, 0)) == ADDR_EXPR) > base = TREE_OPERAND (TREE_OPERAND (base, 0), 0); > > if (!vect_can_force_dr_alignment_p (base, TYPE_ALIGN (vectype))) > { > if (dump_enabled_p ()) > { > dump_printf_loc (MSG_NOTE, vect_location, > "can't force alignment of ref: "); > dump_generic_expr (MSG_NOTE, TDF_SLIM, ref); > dump_printf (MSG_NOTE, "\n"); > } > return true; > } > > Here TYPE_ALIGN (vectype) is 128 (Power vectors are normally aligned on a > 128-bit value), and base_alignment is 64. a.0 is defined as: > > complex(kind=8) [0:D.1831] * restrict a.0; > > In both ELFv1 and ELFv2 ABIs for Power, a complex type is defined to have > the same alignment as the underlying type. So "complex double" has 8-byte > alignment. > > On earlier versions of Power, the decision is fine, because unaligned > accesses are expensive prior to POWER8. With POWER8, though, an unaligned > access will (most of the time) perform as well as an aligned access. So > ideally we would like to teach the vectorizer to allow vectorization here. > > It seems like vect_supportable_dr_alignment ought to be considered as part > of the SLP vectorization decision here, rather than just comparing the base > alignment with the vector type alignment. Adding a check for that allows > things to get a little further, but we still don't vectorize the block. (I > haven't yet looked into why, but I assume more needs to be done downstream > to handle this case.) > > My understanding of the vectorizer is not yet very deep, so before going too > far down the wrong path, I'd like your opinion on the best approach to > fixing the problem. Thanks! I see it only failing due to cost issues (tried ppc64le and -mcpu=power8). The unaligned loads cost 3 and we end up with t.f90:8:0: note: Cost model analysis: Vector inside of loop cost: 40 Vector prologue cost: 8 Vector epilogue cost: 4 Scalar iteration cost: 12 Scalar outside cost: 6 Vector outside cost: 12 prologue iterations: 0 epilogue iterations: 0 t.f90:8:0: note: cost model: the vector iteration cost = 40 divided by the scalar iteration cost = 12 is greater or equal to the vectorization factor = 1. Note that we are (still) not very good in estimating the SLP cost as we account 4 vector loads here (because we essentially will end up with 4 different permutations used), so the "unaligned" part is accounted for too much and likely the permutation cost as well. Both are a limitation of the SLP data structures and not easily fixable. With -fvect-cost-model=unlimited I see both loops vectorized. > Bill