Re: making the new if-converter not mangle IR that is already vectorizer-friendly

Richard Biener Thu, 09 Jul 2015 01:23:36 -0700

On Tue, Jul 7, 2015 at 10:55 PM, Abe <abe_skol...@yahoo.com> wrote:
> [Alan wrote:]
>
>> My understanding is that any decision as to whether one or both of y or z
>> is evaluated (when 'evaluation' involves doing any work,
>> e.g. a load), has already been encoded into the gimple/tree IR. Thus, if
>> we are to only evaluate one of 'y' or 'z' in your example,
>> the IR will (prior to if-conversion), contain basic blocks and control
>> flow, that means we jump around the one that's not evaluated.
>
>
>> This appears to be the case in pr61194.c: prior to if-conversion, the IR
>> for the loop in barX is
>
>
>>   <bb 3>:
>>    # i_16 = PHI <i_13(7), 0(2)>
>>    # ivtmp_21 = PHI <ivtmp_20(7), 1024(2)>
>>    _5 = x[i_16];
>>    _6 = _5 > 0.0;
>>    _7 = w[i_16];
>>    _8 = _7 < 0.0;
>>    _9 = _6 & _8;
>>    if (_9 != 0)
>>      goto <bb 4>;
>>    else
>>      goto <bb 5>;
>>
>>    <bb 4>:
>>    iftmp.0_10 = z[i_16];
>>    goto <bb 6>;
>>
>>    <bb 5>:
>>    iftmp.0_11 = y[i_16];
>
>
> [snip]
>
>
> [note: the following section, until the next quoted line, is mostly an
> explanation
>  of if conversion and its trade-offs; please feel free to skip it or to read
> it later]
>
>
> OK, but that`s overly-conservative in this case for most/all modern
> high-speed processors: "z[i]" and "y[i]" are both pure expressions and
> the possible values of 'i' here [given that the hardware doesn`t
> malfunction, nobody alters the value in a debugger, etc.] can be proven
> to not overflow the bounds of the arrays.  With purity established and
> bounds-validity known to be OK, all we need to do is to evaluate
> the "cost" of evaluating both expressions vs. that of inflicting a
> conditional branch on the hardware.  I believe this "cost" depends on the
> values in the arrays 'x' and 'w', but when the probability of either case --
> "((x[i]>0) & (w[i]<0))" is true or is false -- is very close to
> 50% for a long time, the branch predictor is going to find it difficult to
> make any sense of it, and speculative execution is going to speculatively
> execute the incorrect branch about 50% of the time until the data comes in
> that renders speculation no-longer-needed in the given iteration.
>
> On most/all modern high-speed processors, doing the loads unconditionally is
> going to be better IMO for anything even remotely resembling a "normal"
> or "average" workload.  In other words, so long as the result of "((x[i]>0)
> & (w[i]<0))" changes frequently and unpredictably as 'i' is incremented,
> it should be better to do both loads: the needed data are in cache lines
> that are already in cache, possibly even in L1 data cache, so it`s pretty
> "cheap" to just load them.  Conditional branches based on
> difficult-to-predict data that changes frequently [as the arrays are
> traversed],
> OTOH, are likely to cause stalls and result in slower execution than the
> fastest possible on that CPU for the source code in question.
>
> In cases where the data upon which the condition depends lead to mostly-true
> or mostly-false results,
> the conditional branch should perform well for enough-iterations loops on
> any CPU with a decent branch predictor,
> but those are rare conditions IMO.  I think we should assume random data in
> the arrays unless/until we have analysis that tells
> us otherwise, and I don`t expect much compile-time analysis of the contents
> of arrays to become the norm in compilers anytime soon.
> These are probably extremely-infrequent corner cases anyway, with the
> possible exception of the processing of sparse matrices;
> does anybody reading this with a strong background in numerical/scientific
> computing have anything to comment on this?
>
> Predication of instructions can help to remove the burden of the conditional
> branch, but is not available on all relevant architectures.
> In some architectures that are typically implemented in modern high-speed
> processors -- i.e. with high core frequency, caches, speculative execution,
> etc. --
> there is not full support for predication [as there is, for example, in
> 32-bit ARM] but there _is_ support for "conditional move" [hereinafter
> "cmove"].
> If/when the ISA in question supports cmove to/from main memory, perhaps it
> would be profitable to use two cmoves back-to-back with opposite conditions
> and the same register [destination for load, source for store] to implement
> e.g. "temp = c ? X[x] : Y[y]" and/or "temp = C[c] ? X[x] : Y[y]" etc.
> Even without cmove to/from main memory, two cmoves back-to-back with
> opposite conditions could still be used, e.g. [not for a real-world ISA]:
>   load X[x] -> reg1
>   load Y[y] -> reg2
>   cmove   c  ? reg1 -> reg3
>   cmove (!c) ? reg2 -> reg3
>
> Or even better if the ISA can support something like:
>   load X[x] -> reg1
>   load Y[y] -> reg2
>   cmove (c ? reg1 : reg2) -> reg3
>
> However, this is a code-gen issue from my POV, and at the present time all
> the if-conversion work is occurring at the GIMPLE level.
> If anybody reading this knows how I could make the if converter generate
> GIMPLE that leads to code-gen that is better for at
> least one ISA and the change[s] do/does not negatively impact upon code-gen
> for any other ISA, I`d be glad to read about it.
>
>
>> Without -ftree-loop-if-convert-stores, if-conversion leaves this alone,
>> and vectorization fails  (i.e. the vectorizer
>
>> bails out because the loop has >2 basic blocks). With
>> -ftree-loop-if-convert-stores, if-conversion produces
> [snip]
>>
>> where I have commented the conditional loads that have become
>> unconditional.
>
>> (Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how
>> the if-conversion phase converts loads,
>> too - please correct me if I misunderstand (Richard?) ?!) This contains no
>> control flow, and so is vectorizable.
> [snip]
>> (This is all without your scratchpad patch, of course.)
>
> I think the preceding is quite correct.
>
> Sebastian and I have discussed [at some length, on my part ;-)] what to do
> about the flags.  We definitely want them to change.
> We might wind up with a different number of flags, i.e. not just rename one
> or both of the two we currently have.
>
> Does anybody know of any large codebase that depends heavily on the current
> flags?
> AFAIK these have been infrequently-if-ever used so far in production code.


The GIMPLE level if-conversion code was purely written to make loops suitable
for vectorization.  It wasn't meant to provide if-conversion of scalar
code in the
end (even though it does).  We've discussed enabling the versioning path
unconditionally for example.

So if the new scheme with scratch-pads produces more "correct" code but
code that will known to fail vectorization then it's done at the wrong
place - because
the whole purpose of GIMPLE if-conversion is to enable more vectorization.

Richard.

>
> Regards,
>
> Abe

Re: making the new if-converter not mangle IR that is already vectorizer-friendly

Reply via email to