On Wed, May 5, 2021 at 11:36 AM Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus > <stefa...@linux.ibm.com> wrote: > > > > [snip] > > > > Please find attached a new version of the patch. A major change compared to > > the previous patch is that I created a separate pass which hopefully makes > > reviewing also easier since it is almost self-contained. After realizing > > that > > detecting loops which mimic the behavior of rawmemchr/strlen functions does > > not > > really fit into the topic of loop distribution, I created a separate pass. > > It's true that these reduction-like patterns are more difficult than > the existing > memcpy/memset cases. > > > Due > > to this I was also able to play around a bit and schedule the pass at > > different > > times. Currently it is scheduled right before loop distribution where loop > > header copying already took place which leads to the following effect. > > In fact I'd schedule it after loop distribution so there's the chance that > loop > distribution can expose a loop that fits the new pattern. > > > Running > > this setup over > > > > char *t (char *p) > > { > > for (; *p; ++p); > > return p; > > } > > > > the new pass transforms > > > > char * t (char * p) > > { > > char _1; > > char _7; > > > > <bb 2> [local count: 118111600]: > > _7 = *p_3(D); > > if (_7 != 0) > > goto <bb 5>; [89.00%] > > else > > goto <bb 7>; [11.00%] > > > > <bb 5> [local count: 105119324]: > > > > <bb 3> [local count: 955630225]: > > # p_8 = PHI <p_6(6), p_3(D)(5)> > > p_6 = p_8 + 1; > > _1 = *p_6; > > if (_1 != 0) > > goto <bb 6>; [89.00%] > > else > > goto <bb 8>; [11.00%] > > > > <bb 8> [local count: 105119324]: > > # p_2 = PHI <p_6(3)> > > goto <bb 4>; [100.00%] > > > > <bb 6> [local count: 850510901]: > > goto <bb 3>; [100.00%] > > > > <bb 7> [local count: 12992276]: > > > > <bb 4> [local count: 118111600]: > > # p_9 = PHI <p_2(8), p_3(D)(7)> > > return p_9; > > > > } > > > > into > > > > char * t (char * p) > > { > > char * _5; > > char _7; > > > > <bb 2> [local count: 118111600]: > > _7 = *p_3(D); > > if (_7 != 0) > > goto <bb 5>; [89.00%] > > else > > goto <bb 4>; [11.00%] > > > > <bb 5> [local count: 105119324]: > > _5 = p_3(D) + 1; > > p_10 = .RAWMEMCHR (_5, 0); > > > > <bb 4> [local count: 118111600]: > > # p_9 = PHI <p_10(5), p_3(D)(2)> > > return p_9; > > > > } > > > > which is fine so far. However, I haven't made up my mind so far whether it > > is > > worthwhile to spend more time in order to also eliminate the "first > > unrolling" > > of the loop. > > Might be a phiopt transform ;) Might apply to quite some set of > builtins. I wonder how the strlen case looks like though. > > > I gave it a shot by scheduling the pass prior pass copy header > > and ended up with: > > > > char * t (char * p) > > { > > <bb 2> [local count: 118111600]: > > p_5 = .RAWMEMCHR (p_3(D), 0); > > return p_5; > > > > } > > > > which seems optimal to me. The downside of this is that I have to > > initialize > > scalar evolution analysis which might be undesired that early. > > > > All this brings me to the question where do you see this peace of code > > running? > > If in a separate pass when would you schedule it? If in an existing pass, > > which one would you choose? > > I think it still fits loop distribution. If you manage to detect it > with your pass > standalone then you should be able to detect it in loop distribution. Can you > explain what part is "easier" as standalone pass?
Btw, another "fitting" pass would be final value replacement (pass_scev_cprop) since what these patterns provide is a builtin call to compute the value of one of the loop PHIs on exit. Note this pass leaves removal of in-loop computations to followup DCE which means that in some cases it does unprofitable transforms. There's a bug somewhere where I worked on doing final value replacement on-demand when DCE figures out the loop is otherwise dead but I never finished this (loop distribution could also use such mechanism to get rid of unwanted PHIs). > > Another topic which came up is whether there exists a more elegant solution > > to > > my current implementation in order to deal with stores (I'm speaking of the > > `if > > (store_dr)` statement inside of function transform_loop_1). For example, > > > > extern char *p; > > char *t () > > { > > for (; *p; ++p); > > return p; > > } > > > > ends up as > > > > char * t () > > { > > char * _1; > > char * _2; > > char _3; > > char * p.1_8; > > char _9; > > char * p.1_10; > > char * p.1_11; > > > > <bb 2> [local count: 118111600]: > > p.1_8 = p; > > _9 = *p.1_8; > > if (_9 != 0) > > goto <bb 5>; [89.00%] > > else > > goto <bb 7>; [11.00%] > > > > <bb 5> [local count: 105119324]: > > > > <bb 3> [local count: 955630225]: > > # p.1_10 = PHI <_1(6), p.1_8(5)> > > _1 = p.1_10 + 1; > > p = _1; > > _3 = *_1; > > if (_3 != 0) > > goto <bb 6>; [89.00%] > > else > > goto <bb 8>; [11.00%] > > > > <bb 8> [local count: 105119324]: > > # _2 = PHI <_1(3)> > > goto <bb 4>; [100.00%] > > > > <bb 6> [local count: 850510901]: > > goto <bb 3>; [100.00%] > > > > <bb 7> [local count: 12992276]: > > > > <bb 4> [local count: 118111600]: > > # p.1_11 = PHI <_2(8), p.1_8(7)> > > return p.1_11; > > > > } > > > > where inside the loop a load and store occurs. For a rawmemchr like loop I > > have to show that we never load from a memory location to which we write. > > Currently I solve this by hard coding those facts which are not generic at > > all. > > I gave compute_data_dependences_for_loop a try which failed to determine the > > fact that stores only happen to p[0] and loads from p[i] where i>0. Maybe > > there are more generic solutions to express this in contrast to my current > > one? > > So the example loop is not valid to be trasformed to rawmemchr since it's > valid to call it with p = &p; - but sure, once you pass the first *p != 0 > check > things become fishy but GCC isn't able to turn that into a non-dependence. > > Why is the case of stores inside the loop important? In fact that you > think it is > makes a case for integrating this with loop distribution since loop > distribution > would be able to prove (if possible) that the stores can be separated into > a different loop. > > And sorry for the delay in answering ... > > Thanks, > Richard. > > > > > Thanks again for your input so far. Really appreciated. > > > > Cheers, > > Stefan