On Tue, Jul 7, 2015 at 10:55 PM, Abe <abe_skol...@yahoo.com> wrote: > [Alan wrote:] > >> My understanding is that any decision as to whether one or both of y or z >> is evaluated (when 'evaluation' involves doing any work, >> e.g. a load), has already been encoded into the gimple/tree IR. Thus, if >> we are to only evaluate one of 'y' or 'z' in your example, >> the IR will (prior to if-conversion), contain basic blocks and control >> flow, that means we jump around the one that's not evaluated. > > >> This appears to be the case in pr61194.c: prior to if-conversion, the IR >> for the loop in barX is > > >> <bb 3>: >> # i_16 = PHI <i_13(7), 0(2)> >> # ivtmp_21 = PHI <ivtmp_20(7), 1024(2)> >> _5 = x[i_16]; >> _6 = _5 > 0.0; >> _7 = w[i_16]; >> _8 = _7 < 0.0; >> _9 = _6 & _8; >> if (_9 != 0) >> goto <bb 4>; >> else >> goto <bb 5>; >> >> <bb 4>: >> iftmp.0_10 = z[i_16]; >> goto <bb 6>; >> >> <bb 5>: >> iftmp.0_11 = y[i_16]; > > > [snip] > > > [note: the following section, until the next quoted line, is mostly an > explanation > of if conversion and its trade-offs; please feel free to skip it or to read > it later] > > > OK, but that`s overly-conservative in this case for most/all modern > high-speed processors: "z[i]" and "y[i]" are both pure expressions and > the possible values of 'i' here [given that the hardware doesn`t > malfunction, nobody alters the value in a debugger, etc.] can be proven > to not overflow the bounds of the arrays. With purity established and > bounds-validity known to be OK, all we need to do is to evaluate > the "cost" of evaluating both expressions vs. that of inflicting a > conditional branch on the hardware. I believe this "cost" depends on the > values in the arrays 'x' and 'w', but when the probability of either case -- > "((x[i]>0) & (w[i]<0))" is true or is false -- is very close to > 50% for a long time, the branch predictor is going to find it difficult to > make any sense of it, and speculative execution is going to speculatively > execute the incorrect branch about 50% of the time until the data comes in > that renders speculation no-longer-needed in the given iteration. > > On most/all modern high-speed processors, doing the loads unconditionally is > going to be better IMO for anything even remotely resembling a "normal" > or "average" workload. In other words, so long as the result of "((x[i]>0) > & (w[i]<0))" changes frequently and unpredictably as 'i' is incremented, > it should be better to do both loads: the needed data are in cache lines > that are already in cache, possibly even in L1 data cache, so it`s pretty > "cheap" to just load them. Conditional branches based on > difficult-to-predict data that changes frequently [as the arrays are > traversed], > OTOH, are likely to cause stalls and result in slower execution than the > fastest possible on that CPU for the source code in question. > > In cases where the data upon which the condition depends lead to mostly-true > or mostly-false results, > the conditional branch should perform well for enough-iterations loops on > any CPU with a decent branch predictor, > but those are rare conditions IMO. I think we should assume random data in > the arrays unless/until we have analysis that tells > us otherwise, and I don`t expect much compile-time analysis of the contents > of arrays to become the norm in compilers anytime soon. > These are probably extremely-infrequent corner cases anyway, with the > possible exception of the processing of sparse matrices; > does anybody reading this with a strong background in numerical/scientific > computing have anything to comment on this? > > Predication of instructions can help to remove the burden of the conditional > branch, but is not available on all relevant architectures. > In some architectures that are typically implemented in modern high-speed > processors -- i.e. with high core frequency, caches, speculative execution, > etc. -- > there is not full support for predication [as there is, for example, in > 32-bit ARM] but there _is_ support for "conditional move" [hereinafter > "cmove"]. > If/when the ISA in question supports cmove to/from main memory, perhaps it > would be profitable to use two cmoves back-to-back with opposite conditions > and the same register [destination for load, source for store] to implement > e.g. "temp = c ? X[x] : Y[y]" and/or "temp = C[c] ? X[x] : Y[y]" etc. > Even without cmove to/from main memory, two cmoves back-to-back with > opposite conditions could still be used, e.g. [not for a real-world ISA]: > load X[x] -> reg1 > load Y[y] -> reg2 > cmove c ? reg1 -> reg3 > cmove (!c) ? reg2 -> reg3 > > Or even better if the ISA can support something like: > load X[x] -> reg1 > load Y[y] -> reg2 > cmove (c ? reg1 : reg2) -> reg3 > > However, this is a code-gen issue from my POV, and at the present time all > the if-conversion work is occurring at the GIMPLE level. > If anybody reading this knows how I could make the if converter generate > GIMPLE that leads to code-gen that is better for at > least one ISA and the change[s] do/does not negatively impact upon code-gen > for any other ISA, I`d be glad to read about it. > > >> Without -ftree-loop-if-convert-stores, if-conversion leaves this alone, >> and vectorization fails (i.e. the vectorizer > >> bails out because the loop has >2 basic blocks). With >> -ftree-loop-if-convert-stores, if-conversion produces > [snip] >> >> where I have commented the conditional loads that have become >> unconditional. > >> (Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how >> the if-conversion phase converts loads, >> too - please correct me if I misunderstand (Richard?) ?!) This contains no >> control flow, and so is vectorizable. > [snip] >> (This is all without your scratchpad patch, of course.) > > I think the preceding is quite correct. > > Sebastian and I have discussed [at some length, on my part ;-)] what to do > about the flags. We definitely want them to change. > We might wind up with a different number of flags, i.e. not just rename one > or both of the two we currently have. > > Does anybody know of any large codebase that depends heavily on the current > flags? > AFAIK these have been infrequently-if-ever used so far in production code.
The GIMPLE level if-conversion code was purely written to make loops suitable for vectorization. It wasn't meant to provide if-conversion of scalar code in the end (even though it does). We've discussed enabling the versioning path unconditionally for example. So if the new scheme with scratch-pads produces more "correct" code but code that will known to fail vectorization then it's done at the wrong place - because the whole purpose of GIMPLE if-conversion is to enable more vectorization. Richard. > > Regards, > > Abe