On 01/13/2016 01:10 AM, Jeff Law wrote:
I'm going to focus on adpcm for the moment, in particular adpcm_coder.
It appears the key blocks are:
Looking at adpcm_decoder we have the same idiom as in adpcm_coder:
if (bufferstep_79 != 0)
goto <bb 6>;
else
goto <bb 7>;
;; succ: 6 [50.0%] (TRUE_VALUE,EXECUTABLE)
;; 7 [50.0%] (FALSE_VALUE,EXECUTABLE)
;; basic block 6, loop depth 1, count 0, freq 4550, maybe hot
;; prev block 5, next block 7, flags: (NEW, REACHABLE)
;; pred: 5 [50.0%] (TRUE_VALUE,EXECUTABLE)
delta_31 = inputbuffer_65 & 15;
goto <bb 8>;
;; succ: 8 [100.0%] (FALLTHRU,EXECUTABLE)
;; basic block 7, loop depth 1, count 0, freq 4550, maybe hot
;; prev block 6, next block 8, flags: (NEW, REACHABLE)
;; pred: 5 [50.0%] (FALSE_VALUE,EXECUTABLE)
inp_32 = inp_70 + 1;
_33 = *inp_70;
inputbuffer_34 = (int) _33;
_35 = inputbuffer_34 >> 4;
delta_36 = _35 & 15;
;; succ: 8 [100.0%] (FALLTHRU,EXECUTABLE)
;; basic block 8, loop depth 1, count 0, freq 9100, maybe hot
;; prev block 7, next block 9, flags: (NEW, REACHABLE)
;; pred: 6 [100.0%] (FALLTHRU,EXECUTABLE)
;; 7 [100.0%] (FALLTHRU,EXECUTABLE)
# inp_2 = PHI <inp_70(6), inp_32(7)>
# delta_5 = PHI <delta_31(6), delta_36(7)>
# inputbuffer_16 = PHI <inputbuffer_65(6), inputbuffer_34(7)>
_83 = bufferstep_79 ^ 1;
_1 = _83 & 1;
_39 = indexTable[delta_5];
index_40 = _39 + index_81;
_84 = MIN_EXPR <index_40, 88>;
index_62 = MAX_EXPR <_84, 0>;
sign_41 = delta_5 & 8;
vpdiff_42 = step_71 >> 3;
_43 = delta_5 & 4;
if (_43 != 0)
goto <bb 9>;
else
goto <bb 10>;
The difference is there's all kinds of code between BB8 and the latch.
So it doesn't trigger path splitting, which in turn doesn't expose the
CSE/DCE opportunity for adpcm_decoder. We're likely leaving some
performance on the table for this code.
This brings up a deeper question. Are we approaching path splitting in
a fundamentally backward way?
Right now path splitting is guided by finding the right shape in the CFG
immediately preceding the latch block of a loop. Plus some heuristics
I'm working on to predict when the duplication is more likely to lead to
CSE/DCE opportunities. But really it's driven by finding the right CFG
shape before the latch block.
Should path splitting really be driven by partitions of PHI arguments
and other incoming values (such as the implicit sets from conditionals).
I've been pondering this for a long time as it's a natural extension
of what we do in the erroneous path isolation pass.
ie at any join point in the CFG, look for partitions of the arguments in
any PHI nodes and split paths based on that.
So for example if we had a PHI like
x5 = phi (x4, 0, 0, 1, 1)
Look at how x5 is used. If propagation of any of those PHI argument
values would result in simplifications at the use points of x5, then we
should consider splitting off all the paths with the beneficial PHI
argument.
So in the case above, assume that the value 0 results in
simplifications. We'd create two paths. One for the paths where x5
would get the value zero, another for everything else.
This is *a lot* like the path isolation done erroneous paths in terms of
the CFG & SSA manipulations. Essentially in both cases we want to
isolate a path so that we can manipulate it based on known inputs
without affecting paths with uninteresting inputs. All that really
changes is how we drive the selection of paths to isolate and whether or
not we insert the __builtin_trap or not.
Anyway, going back to adpcm_decode, we do end up splitting this path:
# vpdiff_12 = PHI <vpdiff_11(12), vpdiff_50(13)>
if (sign_41 != 0)
goto <bb 15>;
else
goto <bb 16>;
;; succ: 15
;; 16
;; basic block 15, loop depth 1
;; pred: 14
valpred_51 = valpred_76 - vpdiff_12;
goto <bb 17>;
;; succ: 17
;; basic block 16, loop depth 1
;; pred: 14
valpred_52 = vpdiff_12 + valpred_76;
;; succ: 17
;; basic block 17, loop depth 1
;; pred: 15
;; 16
# valpred_7 = PHI <valpred_51(15), valpred_52(16)>
_85 = MAX_EXPR <valpred_7, -32768>;
valpred_13 = MIN_EXPR <_85, 32767>;
step_53 = stepsizeTable[index_62];
outp_54 = outp_69 + 2;
_55 = (short int) valpred_13;
MEM[base: outp_54, offset: -2B] = _55;
if (outp_54 != _74)
goto <bb 20>;
else
goto <bb 18>;
This doesn't result in anything particularly interesting/good AFAICT.
We propagate valpred_51/52 into the use in the MAX_EXPR in the duplicate
paths, but that doesn't allow any further simplification.
Ajit, can you confirm which of adpcm_code or adpcm_decode where path
splitting is showing a gain? I suspect it's the former but would like
to make sure so that I can adjust the heuristics properly.
jeff