Richard Biener <rguent...@suse.de> writes: > This tries to improve BB vectorization dumps by providing more > precise locations. Currently the vect_location is simply the > very last stmt in a basic-block that has a location. So for > > double a[4], b[4]; > int x[4], y[4]; > void foo() > { > a[0] = b[0]; // line 5 > a[1] = b[1]; > a[2] = b[2]; > a[3] = b[3]; > x[0] = y[0]; // line 9 > x[1] = y[1]; > x[2] = y[2]; > x[3] = y[3]; > } // line 13 > > we show the user with -O3 -fopt-info-vec > > t.c:13:1: optimized: basic block part vectorized using 16 byte vectors > > while with the patch we point to both independently vectorized > opportunities: > > t.c:5:8: optimized: basic block part vectorized using 16 byte vectors > t.c:9:8: optimized: basic block part vectorized using 16 byte vectors > > there's the possibility that the location regresses in case the > root stmt in the SLP instance has no location. For a SLP subgraph > with multiple entries the location also chooses one entry at random, > not sure in which case we want to dump both. > > Still as the plan is to extend the basic-block vectorization > scope from single basic-block to multiple ones this is a first > step to preserve something sensible. > > Implementation-wise this makes both costing and code-generation > happen on the subgraphs as analyzed. > > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. > > Richard - is iteration over vector modes for BB vectorization > still important now that we have related_vector_type and thus > no longer only consider a fixed size? If so it will probably > make sense to somehow still iterate even if there was some > SLP subgraph vectorized? It also looks like BB vectorization > was never updated to consider multiple modes based on cost, > it will still pick the first opportunity. For BB vectorization > we also have the code that re-tries SLP discovery with > splitting the store group. So what's your overall thoughts to > this?
I think there might be different answers for “in principle” and “in practice”. :-) In principle, there's no one right answer to (say) “what vector mode should I use for 4 32-bit integers?”. If the block is only operating on that type, then VNx4SI is the right choice for 128-bit SVE. But if the block is mostly operating on 4 64-bit integers and just converting to 32-bit integers for a small region, then it might be better to use 2 VNx2SIs instead (paired with 2 VNx2DIs). In practice, one situation in which the current loop might be needed is pattern statements. There we assign a vector type during pattern recognition, based only on the element type. So in that situation, the first pass (with the autodetected base vector mode) will not take the number of scalar stmts into account. Also, although SLP currently only operates on full vectors, I was hoping we would eventually support predication for SLP too. At that point, the number of scalar statements wouldn't directly determine the number of vector lanes. On the cost thing: it would be better to try all four and pick the one with the lowest cost, but given your in-progress changes, it seemed like a dead end to do that with the current code. It sounded like the general direction here was to build an SLP graph and “solve” the vector type assignment problem in a more global way, once we have a view of the entire graph. Is that right? If so, then at that point we might be able to do something more intelligent than just iterate over all the options. (Although at the same time, iterating over all the options on a fixed (sub?)graph would be cheaper than what we do now.) Thanks, Richard