Hi Tamar,

on 2020/3/10 δΈ‹εˆ7:31, Tamar Christina wrote:
> 
>> -----Original Message-----
>> From: Gcc <gcc-boun...@gcc.gnu.org> On Behalf Of Richard Biener
>> Sent: Tuesday, March 10, 2020 11:12 AM
>> To: Kewen.Lin <li...@linux.ibm.com>
>> Cc: GCC Development <gcc@gcc.gnu.org>; Segher Boessenkool
>> <seg...@kernel.crashing.org>
>> Subject: Re: How to extend SLP to support this case
>>
>> On Tue, Mar 10, 2020 at 7:52 AM Kewen.Lin <li...@linux.ibm.com> wrote:
>>>
>>> Hi all,
>>>
>>> But how to teach it to be aware of this? Currently the processing
>>> starts from bottom to up (from stores), can we do some analysis on the
>>> SLP instance, detect some pattern and update the whole instance?
>>
>> In theory yes (Tamar had something like that for AARCH64 complex rotations
>> IIRC).  And yes, the issue boils down to how we handle SLP discovery.  I'd 
>> like
>> to improve SLP discovery but it's on my list only after I managed to get rid 
>> of
>> the non-SLP code paths.  I have played with some ideas (even produced
>> hackish patches) to find "seeds" to form SLP groups from using multi-level
>> hashing of stmts.
> 
> I still have this but missed the stage-1 deadline after doing the rewriting 
> to C++ 😊
> 
> We've also been looking at this and the approach I'm investigating now is 
> trying to get
> the SLP codepath to handle this after it's been fully unrolled. I'm looking 
> into whether
> the build-slp can be improved to work for the group size == 16 case that it 
> tries but fails
> on. 
> 

Thanks!  Glad to know you have been working this!

Yes, I saw the standalone SLP pass split the group (16 store stmts) finally.

> My intention is to see if doing so would make it simpler to recognize this as 
> just 4 linear
> loads and two permutes. I think the loop aware SLP will have a much harder 
> time with this
> seeing the load permutations it thinks it needs because of the permutes 
> caused by the +/-
> pattern.

I may miss something, just to double confirm, do you mean for either of p1/p2 
make it 
4 linear loads?  Since as the optimal vectorized version, p1 and p2 have 4 
separate
loads and construction then further permutations.

> 
> One Idea I had before was from your comment on the complex number patch, 
> which is to try
> and move up TWO_OPERATORS and undo the permute always when doing +/-. This 
> would simplify
> the load permute handling and if a target doesn't have an instruction to 
> support this it would just
> fall back to doing an explicit permute after the loads.  But I wasn't sure 
> this approach would get me the
> results I wanted.> 

IIUC, we have to seek for either <a0, a1, a2, a3> or <a0_iter0, a0_iter1, 
a0_iter2, a0_iter3> ...,
since either can leverage the isomorphic byte loads, subtraction, shift and 
addition.

I was thinking that SLP pattern matcher can detect the pattern with two levels 
of TWO_OPERATORS,
one level is with t/0,1,2,3,/, the other is with a/0,1,2,3/, as well as the 
dependent isomorphic
computations for a/0,1,2,3/, transform it into isomorphic subtraction, int 
promotion shift and addition.

> In the end you don't want a loop here at all. And in order to do the above 
> with TWO_OPERATORS I would
> have to let the SLP pattern matcher be able to reduce the group size and 
> increase the no# iterations during
> the matching otherwise the matching itself becomes quite difficult in certain 
> cases..
> 

OK, it sounds unable to get the optimal one which requires all 16 bytes (0-3 or 
4-7 x 4 iterations).

BR,
Kewen

Reply via email to