SVE uses VECT_COMPARE_COSTS to tell the vectoriser to try as many variations as it knows and pick the one with the lowest cost. This serves two purposes:
(1) It means we can compare SVE loops that operate on packed vectors with SVE loops that operate on unpacked vectors. (2) It means that we can compare SVE with Advanced SIMD. Although we used VECT_COMPARE_COSTS for both of these purposes from the outset, the focus initially was more on (1). Adding VECT_COMPARE_COSTS allowed us to use SVE extending loads and truncating stores, in which loads and stores effectively operate on unpacked rather than packed vectors. This part seems to work pretty well in practice. However, it turns out that the second part (Advanced SIMD vs. SVE) is less reliable. There are three main reasons for this: * At the moment, the AArch64 vector cost structures stick rigidly to the vect_cost_for_stmt enumeration provided by target-independent code. This particularly affects vec_to_scalar, which is used for at least: - reductions - extracting an element from a vector to do scalar arithmetic - extracting an element to store it out The vectoriser gives us the information we need to distinguish these cases, but the port wasn't using it. Other problems include undercosting LD[234] and ST[234] instructions and scatter stores. * Currently, the vectoriser costing works by adding up what are typically latency values. As Richi mentioned recently in an x86 context, this effectively means that we treat the scalar and vector code as executing serially. That already causes some problems for Advanced SIMD vs. scalar code, but it turns out to be particularly a problem when comparing SVE with Advanced SIMD. Scalar, Advanced SIMD and SVE can have significantly different issue characteristics, and summing latencies misses some important details, especially in loops involving reductions. * Advanced SIMD code can be completely unrolled at compile time, but length-agnostic SVE code can't. We weren't taking this into account when comparing the costs. This series of patches tries to address these problems by making some opt-in tweaks to the vector cost model. It produces much better results on the SVE workloads that we've tried internally. We'd therefore like to put this in for GCC 11. I'm really sorry that this is landing so late in stage 4. Clearly it would have been much better to do this earlier. However: - The patches “only” change the vector cost hooks. There are no changes elsewhere. In other words, the SVE code we generate and the Advanced SIMD code we generate is unchanged: the “only” thing we're doing is using different heuristics to select between them. - As mentioned above, almost all the new code is “opt-in”. Therefore, only CPUs that explicitly want it (and will benefit from it) will be affected. Most of the code is not executed otherwise. Tested on aarch64-linux-gnu (with and without SVE), pushed to trunk. Richard