dom1 prevents vectorization via partial loop peeling?

Alan Lawrence Mon, 27 Apr 2015 09:13:12 -0700

Hi,

I've been experimenting with some small testcases with the aim of getting thevectorizer to work on more loops. I started by compiling at -O3 the followingtestcase:


#define N 32

int a[N];
int b[N];

int foo ()
{
  for (int i = 0; i < N ; i++)
    {
      int m = (a[i] & i) ? 5 : 4;
      b[i] = a[i] * m;
    }
}

After copyrename3, immediately prior to dom1, the loop body looks like:

  <bb 2>:

  <bb 3>:
  # i_11 = PHI <i_9(7), 0(2)>
  _5 = a[i_11];
  _6 = i_11 & _5;
  if (_6 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:

  <bb 5>:
  # m_2 = PHI <5(4), 4(3)>
  _7 = m_2 * _5;
  b[i_11] = _7;
  i_9 = i_11 + 1;
  if (i_9 != 32)
    goto <bb 7>;
  else
    goto <bb 6>;

  <bb 7>:
  goto <bb 3>;

  <bb 6>:
  return;

dom1 then peeled part of the first loop iteration, producing:

  <bb 2>:
  goto <bb 8>;

  <bb 3>:
  # i_11 = PHI <i_9(6)>
  _5 = a[i_11];
  _6 = i_11 & _5;
  if (_6 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:

  <bb 5>:
  # m_14 = PHI <5(4), 4(3)>

  <bb 6>:
  # m_2 = PHI <m_14(5), 4(8)>
  # _15 = PHI <_5(5), _10(8)>
  # i_16 = PHI <i_11(5), i_1(8)>
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  if (i_9 != 32)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 7>:
  return;

  <bb 8>:
  # i_1 = PHI <0(2)>
  _10 = a[i_1];
  _3 = i_1 & _10;
  goto <bb 6>;


which following ivcanon, had been simplified down to:

  <bb 2>:
  _10 = a[0];
  goto <bb 6>;

  <bb 3>:
  _5 = a[i_9];
  _6 = _5 & i_9;
  if (_6 != 0)
    goto <bb 5>;
  else
    goto <bb 4>;

  <bb 4>:

  <bb 5>:
  # m_14 = PHI <5(3), 4(4)>

  <bb 6>:
  # m_2 = PHI <m_14(5), 4(2)>
  # _15 = PHI <_5(5), _10(2)>
  # i_16 = PHI <i_9(5), 0(2)>
  # ivtmp_13 = PHI <ivtmp_3(5), 32(2)>
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  ivtmp_3 = ivtmp_13 - 1;
  if (ivtmp_3 != 0)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 7>:
  return;

tree-if-conversion wouldn't deal with such a loop, and so the control flow madethe vectorizer bail out too; resulting in scalar code (on both x86_64 andAArch64). I am currently testing a patch to tree-if-conv.c that makes it work onsuch a CFG, producing:


  <bb 2>:
  _10 = a[0];
  goto <bb 4>;

  <bb 3>:

  <bb 4>:
  # m_2 = PHI <m_14(3), 4(2)>
  # _15 = PHI <_5(3), _10(2)>
  # i_16 = PHI <i_9(3), 0(2)>
  # ivtmp_13 = PHI <ivtmp_3(3), 32(2)>
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  ivtmp_3 = ivtmp_13 - 1;
  _5 = a[i_9];
  _6 = _5 & i_9;
  m_14 = _6 != 0 ? 5 : 4;
  if (ivtmp_3 != 0)
    goto <bb 3>;
  else
    goto <bb 5>;

  <bb 5>:
  return;

However, this still fails to vectorize: the PHI nodes at the start of bb 4prevent this (in vect_analyze_scalar_cycles: they look a bit like reductions,but aren't).


In contrast, a slightly-different testcase:

#define N 32

int a[N];
int b[N];

int foo ()
{
  for (int i = 0; i < N ; i++)
    {
      int cond = (a[i] & i) ? -1 : 0; // extra variable here
      int m = (cond) ? 5 : 4;
      b[i] = a[i] * m;
    }
}

after copyrename3, just before dom1, is only slightly different:

  <bb 2>:

  <bb 3>:
  # i_15 = PHI <i_10(7), 0(2)>
  _6 = a[i_15];
  _7 = i_15 & _6;
  if (_7 != 0)
    goto <bb 4>;
  else
    goto <bb 6>;

  <bb 4>:
  # m_3 = PHI <4(6), 5(3)>
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  if (i_10 != 32)
    goto <bb 7>;
  else
    goto <bb 5>;

  <bb 7>:
  goto <bb 3>;

  <bb 5>:
  return;

  <bb 6>:
  goto <bb 4>;

with bb6 being out-of-line at the end of the function, rather than bb4 fallingthrough from just above bb5. However, this prevents dom1 from doing the partialpeeling, and dom1 only changes the "goto bb7" into a "goto bb3":


  <bb 2>:

  <bb 3>:
  # i_15 = PHI <i_10(4), 0(2)>
  _6 = a[i_15];
  _7 = i_15 & _6;
  if (_7 != 0)
    goto <bb 4>;
  else
    goto <bb 6>;

  <bb 4>:
  # m_3 = PHI <4(6), 5(3)>
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  if (i_10 != 32)
    goto <bb 3>;
  else
    goto <bb 5>;

  <bb 5>:
  return;

  <bb 6>:
  goto <bb 4>;

Existing tree_if_conversion deals with this just fine, feeding into the 
vectorizer:

  <bb 2>:

  <bb 3>:
  # i_15 = PHI <i_10(4), 0(2)>
  # ivtmp_12 = PHI <ivtmp_1(4), 32(2)>
  _6 = a[i_15];
  _7 = _6 & i_15;
  m_3 = _7 != 0 ? 5 : 4;
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  ivtmp_1 = ivtmp_12 - 1;
  if (ivtmp_1 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  goto <bb 3>;

Which is vectorized OK (the only PHI is the induction variable i_15, a truescalar cycle).


So, besides the if conversion, I see two issues here:

(1) dom1 should really, in the second case, perform the same partial peelingthat it does in the first testcase, if that is what it thinks is desirable. (Ofcourse, we might want to fix that only later, as ATM that'd take us backwards).

(2) somehow, gcc should vectorize loops that have been partially peeled in thatway. The only way I can really see, is to partially peel the same portion of thefirst (vectorization factor - 1) loop iterations too, so we can do a PHI of awhole vector at a time, and so I'm wondering if anyone can give me any pointershere - am I barking up the right tree - and is it reasonable to persuadeexisting vectorizer loop-peeling code (e.g. for alignment) to do this for ustoo, or would anyone recommend a different avenue?

Alternatively, maybe we don't want dom1 doing that sort of thing (?), but I'minclined to think that if it's doing such optimizations, it's for a good reason;) I guess there'll be other times where we *cannot* do partially peeling oflater iterations...


Thoughts?

Thanks, Alan

dom1 prevents vectorization via partial loop peeling?

Reply via email to