Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Thu, 30 Sep 2021, Andre Vieira (lists) wrote: > >> Hi, >> >> >> >> That just forces trying the vector modes we've tried before. Though I >> >> might >> >> need to revisit this now I think about it. I'm afraid it might be possible >> >> for >> >> this to generate an epilogue with a vf that is not lower than that of the >> >> main >> >> loop, but I'd need to think about this again. >> >> >> >> Either way I don't think this changes the vector modes used for the >> >> epilogue. >> >> But maybe I'm just missing your point here. >> > Yes, I was refering to the above which suggests that when we vectorize >> > the main loop with V4SF but unroll then we try vectorizing the >> > epilogue with V4SF as well (but not unrolled). I think that's >> > premature (not sure if you try V8SF if the main loop was V4SF but >> > unrolled 4 times). >> >> My main motivation for this was because I had a SVE loop that vectorized with >> both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided to unroll >> V8HI by two and skipped using VNx8HI as a predicated epilogue which would've >> been the best choice. > > I see, yes - for fully predicated epilogues it makes sense to consider > the same vector mode as for the main loop anyways (independent on > whether we're unrolling or not). One could argue that with an > unrolled V4SImode main loop a predicated V8SImode epilogue would also > be a good match (but then somehow costing favored the unrolled V4SI > over the V8SI for the main loop...). > >> So that is why I decided to just 'reset' the vector_mode selection. In a >> scenario where you only have the traditional vector modes it might make less >> sense. >> >> Just realized I still didn't add any check to make sure the epilogue has a >> lower VF than the previous loop, though I'm still not sure that could happen. >> I'll go look at where to add that if you agree with this. > > As said above, it only needs a lower VF in case the epilogue is not > fully masked - otherwise the same VF would be OK. > >> >> I can move it there, it would indeed remove the need for the change to >> >> vect_update_vf_for_slp, the change to >> >> vect_determine_partial_vectors_and_peeling would still be required I >> >> think. >> >> It >> >> is meant to disable using partial vectors in an unrolled loop. >> > Why would we disable the use of partial vectors in an unrolled loop? >> The motivation behind that is that the overhead caused by generating >> predicates for each iteration will likely be too much for it to be profitable >> to unroll. On top of that, when dealing with low iteration count loops, if >> executing one predicated iteration would be enough we now still need to >> execute all other unrolled predicated iterations, whereas if we keep them >> unrolled we skip the unrolled loops. > > OK, I guess we're not factoring in costs when deciding on predication > but go for it if it's gernally enabled and possible.
Yeah. That's mostly be design in SVE's case, but I can see that it might need tweaking for other targets. I don't think that's the really the “problem” (if it is a problem) for the unroll decision though. The “correct” way to unroll SVE loops, which we're hoping to add at some point, is to predicate only the final vector in each unrolled iteration. So if the unroll factor is 4, say, the unrolled iterations would be: - unrolled 4x - unpredicated vector 0 - unpredicated vector 1 - unpredicated vector 2 - predicated vector 3 The epilogue loop would then be predicated and execute at most 3 times. Alternatively, there could be two non-looping epilogues: - unrolled 2x - unpredicated vector 0 - predicated vector 1 - not unrolled - predicated vector 0 This ensures that every executed vector operation does at least some useful work. Like Andre says, if we predicate as things stand, we'd have: - unrolled 4x - predicated vector 0 - predicated vector 1 - predicated vector 2 - predicated vector 3 where the code could end up executing 4 vector operations per scalar operation even if the final vector iteration only needs to process 2 elements. I don't think that's a costing issue, it's just building in redundancy that doesn't need to be there. Thanks, Richard