Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Richard Sandiford Wed, 28 Aug 2024 02:04:50 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
>> >>>> The patch was bootstrapped and regtested on aarch64-linux-gnu:
>> >>>> No problems bootstrapping, but several test files (in aarch64-sve.exp:
>> >>>> gather_load_extend_X.c
>> >>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of 
>> >>>> small
>> >>>> differences
>> >>>> in codegen that make some of the scan-assembler-times tests fail.
>> >>>>
>> >>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these 
>> >>>> tests and
>> >> add
>> >>>> some
>> >>>
>> >>> I don't personally like unlimited here as unlimited means just vectorize 
>> >>> at any
>> >>> cost.  This means that costing between modes are also disabled. A lot of 
>> >>> these
>> >>> testcases are intended to test not just that we vectorize but that we 
>> >>> vectorize
>> >>> with efficient code.
>> >>>
>> >>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases.
>> >>
>> >> Yeah, I don't think -fvect-cost-model=unlimited would work for the
>> >> gather_load_extend_X.c tests, since we require the cost model to decide
>> >> when to use extending loads vs loading a full vector and unpacking.
>>
>> I had suggested using -fvect-cost-model=unlimited here because I thought 
>> these
>> tests wanted to test the capability of GCC to detect and generate the 
>> particular SVE
>> feature (gather load extends) for all supported data types regardless of
>> profitability.
>
> I think the problem only specifically for the load_extend tests, because 
> unlimited also
> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. SIMD 
> vs SVE.
> That is while it would generate a gather, it may choose to instead of gather 
> from
> B -> D, do B -> S and then extend the results.  Because this has a higher VF.
>
> Without the cross SVE mode comparisons it wouldn't know that the extensions 
> would
> actually slow it down.


Right.

>> If the tests are intended to also make the right profitability decisions for 
>> the generic
>> tuning model then I agree using -fvect-cost-model=unlimited is not 
>> appropriate
>> here, though I do think that it’d be useful to fix the backend vector cost 
>> model
>> hooks to honor -fvect-cost-model=unlimited and not limit generation of
>> gather/scatter in that case. What it should do for the SVE vs Neon decisions 
>> is an
>> open question.

FTR, most of the tests are instead testing specific code generation
strategies, regardless of whether those strategies are good for
particular real cores.  So forcing the choice is ok in principle.
It's just that -fvect-cost-model=unlimited wouldn't do that in this case.

>> >> [...tries patch...]
>> >>
>> >> It seems like three main things are contributing to the difference:
>> >>
>> >> 1. we now cost 0 for a scalar prologue extension from a loaded value
>> >> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost
>> >> 3. we now apply a large one-off cost for gathers (Tamar's recent change)
>> >>
>> >> (1) is expected.
>> >>
>> >> (2) partly looks like a latent bug.  We're using the x32 cost for
>> >> VNx2QI->VNx2SI, even though those are really .B->.D loads.
>> >>
>> >> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info
>> >> *vinfo, vect_cost_for_stmt kind,
>> >>       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>> >> VMAT_GATHER_SCATTER)
>> >>     {
>> >>       unsigned int nunits = vect_nunits_for_cost (vectype);
>> >> -      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>> >> +      if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
>> >> aarch64_sve_vg))
>> >>      return { sve_costs->gather_load_x64_cost, nunits };
>> >>       return { sve_costs->gather_load_x32_cost, nunits };
>> >>     }
>> >>
>> >> fixes that.
>>
>> Would you be willing to test and push that to trunk to get it out of the way?

Sure.

>> >> (3) is interesting.  generic_vector_cost isn't used by default for any
>> >> SVE CPU, or any -march=...+sve.  So the question is: should we treat it
>> >> as "architectural intent"/somewhat idealised?  Or should we try to make
>> >> it produce good code for existing SVE cores, in which case it would
>> >> overlap quite a lot with generic_armv8_a and generic_armv9_a.
>>
>> >>
>> >> If the former, we could set gather_load_x32_init_cost and
>> >> gather_load_x64_init_cost to 0 for generic_sve_vector_cost
>> >> (and nothing else, so that generic_armv*_a are unaffected).
>>
>> I don’t have strong opinions on this point but if Tamar is concerned about
>> deviating too much from the known-good Advanced SIMD generic tuning we have
>> now then we should aim for minimal codegen changes in that?
>
> No I'm fine with Richard's suggestion.  The only way you'd be able to get 
> this model
> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with 
> assuming
> gathers have no initial overhead.

OK, great.  Should I do this as well?  I suppose it's a separate patch
from removing the "new cost model" flag.

Richard

Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to