SVE uses VECT_COMPARE_COSTS to tell the vectoriser to try as many
variations as it knows and pick the one with the lowest cost.
This serves two purposes:

(1) It means we can compare SVE loops that operate on packed vectors
    with SVE loops that operate on unpacked vectors.

(2) It means that we can compare SVE with Advanced SIMD.

Although we used VECT_COMPARE_COSTS for both of these purposes from the
outset, the focus initially was more on (1).  Adding VECT_COMPARE_COSTS
allowed us to use SVE extending loads and truncating stores, in which
loads and stores effectively operate on unpacked rather than packed
vectors.  This part seems to work pretty well in practice.

However, it turns out that the second part (Advanced SIMD vs. SVE)
is less reliable.  There are three main reasons for this:

* At the moment, the AArch64 vector cost structures stick rigidly to the
  vect_cost_for_stmt enumeration provided by target-independent code.
  This particularly affects vec_to_scalar, which is used for at least:

  - reductions
  - extracting an element from a vector to do scalar arithmetic
  - extracting an element to store it out

  The vectoriser gives us the information we need to distinguish
  these cases, but the port wasn't using it.  Other problems include
  undercosting LD[234] and ST[234] instructions and scatter stores.

* Currently, the vectoriser costing works by adding up what are typically
  latency values.  As Richi mentioned recently in an x86 context,
  this effectively means that we treat the scalar and vector code
  as executing serially.  That already causes some problems for
  Advanced SIMD vs. scalar code, but it turns out to be particularly
  a problem when comparing SVE with Advanced SIMD.  Scalar, Advanced
  SIMD and SVE can have significantly different issue characteristics,
  and summing latencies misses some important details, especially in
  loops involving reductions.

* Advanced SIMD code can be completely unrolled at compile time,
  but length-agnostic SVE code can't.  We weren't taking this into
  account when comparing the costs.

This series of patches tries to address these problems by making
some opt-in tweaks to the vector cost model.  It produces much better
results on the SVE workloads that we've tried internally.  We'd therefore
like to put this in for GCC 11.

I'm really sorry that this is landing so late in stage 4.  Clearly it
would have been much better to do this earlier.  However:

- The patches “only” change the vector cost hooks.  There are no changes
  elsewhere.  In other words, the SVE code we generate and the Advanced
  SIMD code we generate is unchanged: the “only” thing we're doing is
  using different heuristics to select between them.

- As mentioned above, almost all the new code is “opt-in”.  Therefore,
  only CPUs that explicitly want it (and will benefit from it) will be
  affected.  Most of the code is not executed otherwise.

Tested on aarch64-linux-gnu (with and without SVE), pushed to trunk.

Richard

Reply via email to