On 01/13/2016 01:10 AM, Jeff Law wrote:

I'm going to focus on adpcm for the moment, in particular adpcm_coder.
It appears the key blocks are:
Looking at adpcm_decoder we have the same idiom as in adpcm_coder:

  if (bufferstep_79 != 0)
    goto <bb 6>;
  else
    goto <bb 7>;
;;    succ:       6 [50.0%]  (TRUE_VALUE,EXECUTABLE)
;;                7 [50.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 6, loop depth 1, count 0, freq 4550, maybe hot
;;    prev block 5, next block 7, flags: (NEW, REACHABLE)
;;    pred:       5 [50.0%]  (TRUE_VALUE,EXECUTABLE)
  delta_31 = inputbuffer_65 & 15;
  goto <bb 8>;
;;    succ:       8 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 7, loop depth 1, count 0, freq 4550, maybe hot
;;    prev block 6, next block 8, flags: (NEW, REACHABLE)
;;    pred:       5 [50.0%]  (FALSE_VALUE,EXECUTABLE)
  inp_32 = inp_70 + 1;
  _33 = *inp_70;
  inputbuffer_34 = (int) _33;
  _35 = inputbuffer_34 >> 4;
  delta_36 = _35 & 15;
;;    succ:       8 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 8, loop depth 1, count 0, freq 9100, maybe hot
;;    prev block 7, next block 9, flags: (NEW, REACHABLE)
;;    pred:       6 [100.0%]  (FALLTHRU,EXECUTABLE)
;;                7 [100.0%]  (FALLTHRU,EXECUTABLE)
  # inp_2 = PHI <inp_70(6), inp_32(7)>
  # delta_5 = PHI <delta_31(6), delta_36(7)>
  # inputbuffer_16 = PHI <inputbuffer_65(6), inputbuffer_34(7)>
  _83 = bufferstep_79 ^ 1;
  _1 = _83 & 1;
  _39 = indexTable[delta_5];
  index_40 = _39 + index_81;
  _84 = MIN_EXPR <index_40, 88>;
  index_62 = MAX_EXPR <_84, 0>;
  sign_41 = delta_5 & 8;
  vpdiff_42 = step_71 >> 3;
  _43 = delta_5 & 4;
  if (_43 != 0)
    goto <bb 9>;
  else
    goto <bb 10>;



The difference is there's all kinds of code between BB8 and the latch. So it doesn't trigger path splitting, which in turn doesn't expose the CSE/DCE opportunity for adpcm_decoder. We're likely leaving some performance on the table for this code.

This brings up a deeper question. Are we approaching path splitting in a fundamentally backward way?

Right now path splitting is guided by finding the right shape in the CFG immediately preceding the latch block of a loop. Plus some heuristics I'm working on to predict when the duplication is more likely to lead to CSE/DCE opportunities. But really it's driven by finding the right CFG shape before the latch block.

Should path splitting really be driven by partitions of PHI arguments and other incoming values (such as the implicit sets from conditionals). I've been pondering this for a long time as it's a natural extension of what we do in the erroneous path isolation pass.

ie at any join point in the CFG, look for partitions of the arguments in any PHI nodes and split paths based on that.

So for example if we had a PHI like

x5 = phi (x4, 0, 0, 1, 1)

Look at how x5 is used. If propagation of any of those PHI argument values would result in simplifications at the use points of x5, then we should consider splitting off all the paths with the beneficial PHI argument.

So in the case above, assume that the value 0 results in simplifications. We'd create two paths. One for the paths where x5 would get the value zero, another for everything else.

This is *a lot* like the path isolation done erroneous paths in terms of the CFG & SSA manipulations. Essentially in both cases we want to isolate a path so that we can manipulate it based on known inputs without affecting paths with uninteresting inputs. All that really changes is how we drive the selection of paths to isolate and whether or not we insert the __builtin_trap or not.


Anyway, going back to adpcm_decode, we do end up splitting this path:

 # vpdiff_12 = PHI <vpdiff_11(12), vpdiff_50(13)>
  if (sign_41 != 0)
    goto <bb 15>;
  else
    goto <bb 16>;
;;    succ:       15
;;                16

;;   basic block 15, loop depth 1
;;    pred:       14
  valpred_51 = valpred_76 - vpdiff_12;
  goto <bb 17>;
;;    succ:       17

;;   basic block 16, loop depth 1
;;    pred:       14
  valpred_52 = vpdiff_12 + valpred_76;
;;    succ:       17

;;   basic block 17, loop depth 1
;;    pred:       15
;;                16
  # valpred_7 = PHI <valpred_51(15), valpred_52(16)>
  _85 = MAX_EXPR <valpred_7, -32768>;
  valpred_13 = MIN_EXPR <_85, 32767>;
  step_53 = stepsizeTable[index_62];
  outp_54 = outp_69 + 2;
  _55 = (short int) valpred_13;
  MEM[base: outp_54, offset: -2B] = _55;
  if (outp_54 != _74)
    goto <bb 20>;
  else
    goto <bb 18>;

This doesn't result in anything particularly interesting/good AFAICT. We propagate valpred_51/52 into the use in the MAX_EXPR in the duplicate paths, but that doesn't allow any further simplification.

Ajit, can you confirm which of adpcm_code or adpcm_decode where path splitting is showing a gain? I suspect it's the former but would like to make sure so that I can adjust the heuristics properly.


jeff

Reply via email to