On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus <stefa...@linux.ibm.com> wrote: > > On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote: > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus > > <stefa...@linux.ibm.com> wrote: > > > > > > [snip] > > > > > > Please find attached a new version of the patch. A major change compared > > > to > > > the previous patch is that I created a separate pass which hopefully makes > > > reviewing also easier since it is almost self-contained. After realizing > > > that > > > detecting loops which mimic the behavior of rawmemchr/strlen functions > > > does not > > > really fit into the topic of loop distribution, I created a separate pass. > > > > It's true that these reduction-like patterns are more difficult than > > the existing > > memcpy/memset cases. > > > > > Due > > > to this I was also able to play around a bit and schedule the pass at > > > different > > > times. Currently it is scheduled right before loop distribution where > > > loop > > > header copying already took place which leads to the following effect. > > > > In fact I'd schedule it after loop distribution so there's the chance that > > loop > > distribution can expose a loop that fits the new pattern. > > > > > Running > > > this setup over > > > > > > char *t (char *p) > > > { > > > for (; *p; ++p); > > > return p; > > > } > > > > > > the new pass transforms > > > > > > char * t (char * p) > > > { > > > char _1; > > > char _7; > > > > > > <bb 2> [local count: 118111600]: > > > _7 = *p_3(D); > > > if (_7 != 0) > > > goto <bb 5>; [89.00%] > > > else > > > goto <bb 7>; [11.00%] > > > > > > <bb 5> [local count: 105119324]: > > > > > > <bb 3> [local count: 955630225]: > > > # p_8 = PHI <p_6(6), p_3(D)(5)> > > > p_6 = p_8 + 1; > > > _1 = *p_6; > > > if (_1 != 0) > > > goto <bb 6>; [89.00%] > > > else > > > goto <bb 8>; [11.00%] > > > > > > <bb 8> [local count: 105119324]: > > > # p_2 = PHI <p_6(3)> > > > goto <bb 4>; [100.00%] > > > > > > <bb 6> [local count: 850510901]: > > > goto <bb 3>; [100.00%] > > > > > > <bb 7> [local count: 12992276]: > > > > > > <bb 4> [local count: 118111600]: > > > # p_9 = PHI <p_2(8), p_3(D)(7)> > > > return p_9; > > > > > > } > > > > > > into > > > > > > char * t (char * p) > > > { > > > char * _5; > > > char _7; > > > > > > <bb 2> [local count: 118111600]: > > > _7 = *p_3(D); > > > if (_7 != 0) > > > goto <bb 5>; [89.00%] > > > else > > > goto <bb 4>; [11.00%] > > > > > > <bb 5> [local count: 105119324]: > > > _5 = p_3(D) + 1; > > > p_10 = .RAWMEMCHR (_5, 0); > > > > > > <bb 4> [local count: 118111600]: > > > # p_9 = PHI <p_10(5), p_3(D)(2)> > > > return p_9; > > > > > > } > > > > > > which is fine so far. However, I haven't made up my mind so far whether > > > it is > > > worthwhile to spend more time in order to also eliminate the "first > > > unrolling" > > > of the loop. > > > > Might be a phiopt transform ;) Might apply to quite some set of > > builtins. I wonder how the strlen case looks like though. > > > > > I gave it a shot by scheduling the pass prior pass copy header > > > and ended up with: > > > > > > char * t (char * p) > > > { > > > <bb 2> [local count: 118111600]: > > > p_5 = .RAWMEMCHR (p_3(D), 0); > > > return p_5; > > > > > > } > > > > > > which seems optimal to me. The downside of this is that I have to > > > initialize > > > scalar evolution analysis which might be undesired that early. > > > > > > All this brings me to the question where do you see this peace of code > > > running? > > > If in a separate pass when would you schedule it? If in an existing pass, > > > which one would you choose? > > > > I think it still fits loop distribution. If you manage to detect it > > with your pass > > standalone then you should be able to detect it in loop distribution. > > If a loop is distributed only because one of the partitions matches a > rawmemchr/strlen-like loop pattern, then we have at least two partitions > which walk over the same memory region. Since a rawmemchr/strlen-like > loop has no body (neglecting expression-3 of a for-loop where just an > increment happens) it is governed by the memory accesses in the loop > condition. Therefore, in such a case loop distribution would result in > performance degradation. This is why I think that it does not fit > conceptually into ldist pass. However, since I make use of a couple of > helper functions from ldist pass, it may still fit technically. > > Since currently all ldist optimizations operate over loops where niters > is known and for rawmemchr/strlen-like loops this is not the case, it is > not possible that those optimizations expose a loop which is suitable > for rawmemchr/strlen optimization.
True - though that seems to be an unnecessary restriction. > Therefore, what do you think about > scheduling rawmemchr/strlen optimization right between those > if-statements of function loop_distribution::execute? > > if (nb_generated_loops + nb_generated_calls > 0) > { > changed = true; > if (dump_enabled_p ()) > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS, > loc, "Loop%s %d distributed: split to %d loops " > "and %d library calls.\n", str, loop->num, > nb_generated_loops, nb_generated_calls); > > break; > } > > // rawmemchr/strlen like loops > > if (dump_file && (dump_flags & TDF_DETAILS)) > fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num); but we won't ever arrive here because of the niters condition. But yes, doing the pattern matching in the innermost loop processing code looks good to me - for the specific case it would be /* Don't distribute loop if niters is unknown. */ tree niters = number_of_latch_executions (loop); if (niters == NULL_TREE || niters == chrec_dont_know) ---> here? continue; > > Can you > > explain what part is "easier" as standalone pass? > > Yea that term is rather misleading. It was probably easier for me to > understand the underlying problem and to play around with my code. > There are no technical reasons for a standalone pass. And sorry for the late response... Richard. > Cheers, > Stefan > > > > > > Another topic which came up is whether there exists a more elegant > > > solution to > > > my current implementation in order to deal with stores (I'm speaking of > > > the `if > > > (store_dr)` statement inside of function transform_loop_1). For example, > > > > > > extern char *p; > > > char *t () > > > { > > > for (; *p; ++p); > > > return p; > > > } > > > > > > ends up as > > > > > > char * t () > > > { > > > char * _1; > > > char * _2; > > > char _3; > > > char * p.1_8; > > > char _9; > > > char * p.1_10; > > > char * p.1_11; > > > > > > <bb 2> [local count: 118111600]: > > > p.1_8 = p; > > > _9 = *p.1_8; > > > if (_9 != 0) > > > goto <bb 5>; [89.00%] > > > else > > > goto <bb 7>; [11.00%] > > > > > > <bb 5> [local count: 105119324]: > > > > > > <bb 3> [local count: 955630225]: > > > # p.1_10 = PHI <_1(6), p.1_8(5)> > > > _1 = p.1_10 + 1; > > > p = _1; > > > _3 = *_1; > > > if (_3 != 0) > > > goto <bb 6>; [89.00%] > > > else > > > goto <bb 8>; [11.00%] > > > > > > <bb 8> [local count: 105119324]: > > > # _2 = PHI <_1(3)> > > > goto <bb 4>; [100.00%] > > > > > > <bb 6> [local count: 850510901]: > > > goto <bb 3>; [100.00%] > > > > > > <bb 7> [local count: 12992276]: > > > > > > <bb 4> [local count: 118111600]: > > > # p.1_11 = PHI <_2(8), p.1_8(7)> > > > return p.1_11; > > > > > > } > > > > > > where inside the loop a load and store occurs. For a rawmemchr like loop > > > I > > > have to show that we never load from a memory location to which we write. > > > Currently I solve this by hard coding those facts which are not generic > > > at all. > > > I gave compute_data_dependences_for_loop a try which failed to determine > > > the > > > fact that stores only happen to p[0] and loads from p[i] where i>0. Maybe > > > there are more generic solutions to express this in contrast to my > > > current one? > > > > So the example loop is not valid to be trasformed to rawmemchr since it's > > valid to call it with p = &p; - but sure, once you pass the first *p != 0 > > check > > things become fishy but GCC isn't able to turn that into a non-dependence. > > > > Why is the case of stores inside the loop important? In fact that you > > think it is > > makes a case for integrating this with loop distribution since loop > > distribution > > would be able to prove (if possible) that the stores can be separated into > > a different loop. > > > > And sorry for the delay in answering ... > > > > Thanks, > > Richard. > > > > > > > > Thanks again for your input so far. Really appreciated. > > > > > > Cheers, > > > Stefan