Re: [RFC] ldist: Recognize rawmemchr loop patterns

Richard Biener via Gcc-patches Wed, 05 May 2021 03:04:17 -0700

On Wed, May 5, 2021 at 11:36 AM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> <stefa...@linux.ibm.com> wrote:
> >
> > [snip]
> >
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing 
> > that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> > not
> > really fit into the topic of loop distribution, I created a separate pass.
>
> It's true that these reduction-like patterns are more difficult than
> the existing
> memcpy/memset cases.
>
> >  Due
> > to this I was also able to play around a bit and schedule the pass at 
> > different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.
>
> In fact I'd schedule it after loop distribution so there's the chance that 
> loop
> distribution can expose a loop that fits the new pattern.
>
> >  Running
> > this setup over
> >
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > the new pass transforms
> >
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> >
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >
> >   <bb 3> [local count: 955630225]:
> >   # p_8 = PHI <p_6(6), p_3(D)(5)>
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> >
> >   <bb 8> [local count: 105119324]:
> >   # p_2 = PHI <p_6(3)>
> >   goto <bb 4>; [100.00%]
> >
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> >
> >   <bb 7> [local count: 12992276]:
> >
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_2(8), p_3(D)(7)>
> >   return p_9;
> >
> > }
> >
> > into
> >
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> >
> >   <bb 2> [local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 4>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> >
> >   <bb 4> [local count: 118111600]:
> >   # p_9 = PHI <p_10(5), p_3(D)(2)>
> >   return p_9;
> >
> > }
> >
> > which is fine so far.  However, I haven't made up my mind so far whether it 
> > is
> > worthwhile to spend more time in order to also eliminate the "first 
> > unrolling"
> > of the loop.
>
> Might be a phiopt transform ;)  Might apply to quite some set of
> builtins.  I wonder how the strlen case looks like though.
>
> > I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> >
> > char * t (char * p)
> > {
> >   <bb 2> [local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> >
> > }
> >
> > which seems optimal to me.  The downside of this is that I have to 
> > initialize
> > scalar evolution analysis which might be undesired that early.
> >
> > All this brings me to the question where do you see this peace of code 
> > running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
>
> I think it still fits loop distribution.  If you manage to detect it
> with your pass
> standalone then you should be able to detect it in loop distribution.  Can you
> explain what part is "easier" as standalone pass?


Btw, another "fitting" pass would be final value replacement (pass_scev_cprop)
since what these patterns provide is a builtin call to compute the value of one
of the loop PHIs on exit.  Note this pass leaves removal of in-loop computations
to followup DCE which means that in some cases it does unprofitable transforms.
There's a bug somewhere where I worked on doing final value replacement
on-demand when DCE figures out the loop is otherwise dead but I never finished
this (loop distribution could also use such mechanism to get rid of
unwanted PHIs).

> > Another topic which came up is whether there exists a more elegant solution 
> > to
> > my current implementation in order to deal with stores (I'm speaking of the 
> > `if
> > (store_dr)` statement inside of function transform_loop_1).  For example,
> >
> > extern char *p;
> > char *t ()
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > ends up as
> >
> > char * t ()
> > {
> >   char * _1;
> >   char * _2;
> >   char _3;
> >   char * p.1_8;
> >   char _9;
> >   char * p.1_10;
> >   char * p.1_11;
> >
> >   <bb 2> [local count: 118111600]:
> >   p.1_8 = p;
> >   _9 = *p.1_8;
> >   if (_9 != 0)
> >     goto <bb 5>; [89.00%]
> >   else
> >     goto <bb 7>; [11.00%]
> >
> >   <bb 5> [local count: 105119324]:
> >
> >   <bb 3> [local count: 955630225]:
> >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> >   _1 = p.1_10 + 1;
> >   p = _1;
> >   _3 = *_1;
> >   if (_3 != 0)
> >     goto <bb 6>; [89.00%]
> >   else
> >     goto <bb 8>; [11.00%]
> >
> >   <bb 8> [local count: 105119324]:
> >   # _2 = PHI <_1(3)>
> >   goto <bb 4>; [100.00%]
> >
> >   <bb 6> [local count: 850510901]:
> >   goto <bb 3>; [100.00%]
> >
> >   <bb 7> [local count: 12992276]:
> >
> >   <bb 4> [local count: 118111600]:
> >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> >   return p.1_11;
> >
> > }
> >
> > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > have to show that we never load from a memory location to which we write.
> > Currently I solve this by hard coding those facts which are not generic at 
> > all.
> > I gave compute_data_dependences_for_loop a try which failed to determine the
> > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > there are more generic solutions to express this in contrast to my current 
> > one?
>
> So the example loop is not valid to be trasformed to rawmemchr since it's
> valid to call it with p = &p; - but sure, once you pass the first *p != 0 
> check
> things become fishy but GCC isn't able to turn that into a non-dependence.
>
> Why is the case of stores inside the loop important?  In fact that you
> think it is
> makes a case for integrating this with loop distribution since loop 
> distribution
> would be able to prove (if possible) that the stores can be separated into
> a different loop.
>
> And sorry for the delay in answering ...
>
> Thanks,
> Richard.
>
> >
> > Thanks again for your input so far.  Really appreciated.
> >
> > Cheers,
> > Stefan

Re: [RFC] ldist: Recognize rawmemchr loop patterns

Reply via email to