Hi Tamar, on 2020/3/10 δΈε7:31, Tamar Christina wrote: > >> -----Original Message----- >> From: Gcc <gcc-boun...@gcc.gnu.org> On Behalf Of Richard Biener >> Sent: Tuesday, March 10, 2020 11:12 AM >> To: Kewen.Lin <li...@linux.ibm.com> >> Cc: GCC Development <gcc@gcc.gnu.org>; Segher Boessenkool >> <seg...@kernel.crashing.org> >> Subject: Re: How to extend SLP to support this case >> >> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <li...@linux.ibm.com> wrote: >>> >>> Hi all, >>> >>> But how to teach it to be aware of this? Currently the processing >>> starts from bottom to up (from stores), can we do some analysis on the >>> SLP instance, detect some pattern and update the whole instance? >> >> In theory yes (Tamar had something like that for AARCH64 complex rotations >> IIRC). And yes, the issue boils down to how we handle SLP discovery. I'd >> like >> to improve SLP discovery but it's on my list only after I managed to get rid >> of >> the non-SLP code paths. I have played with some ideas (even produced >> hackish patches) to find "seeds" to form SLP groups from using multi-level >> hashing of stmts. > > I still have this but missed the stage-1 deadline after doing the rewriting > to C++ π > > We've also been looking at this and the approach I'm investigating now is > trying to get > the SLP codepath to handle this after it's been fully unrolled. I'm looking > into whether > the build-slp can be improved to work for the group size == 16 case that it > tries but fails > on. >
Thanks! Glad to know you have been working this! Yes, I saw the standalone SLP pass split the group (16 store stmts) finally. > My intention is to see if doing so would make it simpler to recognize this as > just 4 linear > loads and two permutes. I think the loop aware SLP will have a much harder > time with this > seeing the load permutations it thinks it needs because of the permutes > caused by the +/- > pattern. I may miss something, just to double confirm, do you mean for either of p1/p2 make it 4 linear loads? Since as the optimal vectorized version, p1 and p2 have 4 separate loads and construction then further permutations. > > One Idea I had before was from your comment on the complex number patch, > which is to try > and move up TWO_OPERATORS and undo the permute always when doing +/-. This > would simplify > the load permute handling and if a target doesn't have an instruction to > support this it would just > fall back to doing an explicit permute after the loads. But I wasn't sure > this approach would get me the > results I wanted.> IIUC, we have to seek for either <a0, a1, a2, a3> or <a0_iter0, a0_iter1, a0_iter2, a0_iter3> ..., since either can leverage the isomorphic byte loads, subtraction, shift and addition. I was thinking that SLP pattern matcher can detect the pattern with two levels of TWO_OPERATORS, one level is with t/0,1,2,3,/, the other is with a/0,1,2,3/, as well as the dependent isomorphic computations for a/0,1,2,3/, transform it into isomorphic subtraction, int promotion shift and addition. > In the end you don't want a loop here at all. And in order to do the above > with TWO_OPERATORS I would > have to let the SLP pattern matcher be able to reduce the group size and > increase the no# iterations during > the matching otherwise the matching itself becomes quite difficult in certain > cases.. > OK, it sounds unable to get the optimal one which requires all 16 bytes (0-3 or 4-7 x 4 iterations). BR, Kewen