Hi,

I've been experimenting with some small testcases with the aim of getting the vectorizer to work on more loops. I started by compiling at -O3 the following testcase:

#define N 32

int a[N];
int b[N];

int foo ()
{
  for (int i = 0; i < N ; i++)
    {
      int m = (a[i] & i) ? 5 : 4;
      b[i] = a[i] * m;
    }
}

After copyrename3, immediately prior to dom1, the loop body looks like:

  <bb 2>:

  <bb 3>:
  # i_11 = PHI <i_9(7), 0(2)>
  _5 = a[i_11];
  _6 = i_11 & _5;
  if (_6 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:

  <bb 5>:
  # m_2 = PHI <5(4), 4(3)>
  _7 = m_2 * _5;
  b[i_11] = _7;
  i_9 = i_11 + 1;
  if (i_9 != 32)
    goto <bb 7>;
  else
    goto <bb 6>;

  <bb 7>:
  goto <bb 3>;

  <bb 6>:
  return;

dom1 then peeled part of the first loop iteration, producing:

  <bb 2>:
  goto <bb 8>;

  <bb 3>:
  # i_11 = PHI <i_9(6)>
  _5 = a[i_11];
  _6 = i_11 & _5;
  if (_6 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:

  <bb 5>:
  # m_14 = PHI <5(4), 4(3)>

  <bb 6>:
  # m_2 = PHI <m_14(5), 4(8)>
  # _15 = PHI <_5(5), _10(8)>
  # i_16 = PHI <i_11(5), i_1(8)>
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  if (i_9 != 32)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 7>:
  return;

  <bb 8>:
  # i_1 = PHI <0(2)>
  _10 = a[i_1];
  _3 = i_1 & _10;
  goto <bb 6>;


which following ivcanon, had been simplified down to:

  <bb 2>:
  _10 = a[0];
  goto <bb 6>;

  <bb 3>:
  _5 = a[i_9];
  _6 = _5 & i_9;
  if (_6 != 0)
    goto <bb 5>;
  else
    goto <bb 4>;

  <bb 4>:

  <bb 5>:
  # m_14 = PHI <5(3), 4(4)>

  <bb 6>:
  # m_2 = PHI <m_14(5), 4(2)>
  # _15 = PHI <_5(5), _10(2)>
  # i_16 = PHI <i_9(5), 0(2)>
  # ivtmp_13 = PHI <ivtmp_3(5), 32(2)>
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  ivtmp_3 = ivtmp_13 - 1;
  if (ivtmp_3 != 0)
    goto <bb 3>;
  else
    goto <bb 7>;

  <bb 7>:
  return;

tree-if-conversion wouldn't deal with such a loop, and so the control flow made the vectorizer bail out too; resulting in scalar code (on both x86_64 and AArch64). I am currently testing a patch to tree-if-conv.c that makes it work on such a CFG, producing:

  <bb 2>:
  _10 = a[0];
  goto <bb 4>;

  <bb 3>:

  <bb 4>:
  # m_2 = PHI <m_14(3), 4(2)>
  # _15 = PHI <_5(3), _10(2)>
  # i_16 = PHI <i_9(3), 0(2)>
  # ivtmp_13 = PHI <ivtmp_3(3), 32(2)>
  _7 = m_2 * _15;
  b[i_16] = _7;
  i_9 = i_16 + 1;
  ivtmp_3 = ivtmp_13 - 1;
  _5 = a[i_9];
  _6 = _5 & i_9;
  m_14 = _6 != 0 ? 5 : 4;
  if (ivtmp_3 != 0)
    goto <bb 3>;
  else
    goto <bb 5>;

  <bb 5>:
  return;

However, this still fails to vectorize: the PHI nodes at the start of bb 4 prevent this (in vect_analyze_scalar_cycles: they look a bit like reductions, but aren't).

In contrast, a slightly-different testcase:

#define N 32

int a[N];
int b[N];

int foo ()
{
  for (int i = 0; i < N ; i++)
    {
      int cond = (a[i] & i) ? -1 : 0; // extra variable here
      int m = (cond) ? 5 : 4;
      b[i] = a[i] * m;
    }
}

after copyrename3, just before dom1, is only slightly different:

  <bb 2>:

  <bb 3>:
  # i_15 = PHI <i_10(7), 0(2)>
  _6 = a[i_15];
  _7 = i_15 & _6;
  if (_7 != 0)
    goto <bb 4>;
  else
    goto <bb 6>;

  <bb 4>:
  # m_3 = PHI <4(6), 5(3)>
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  if (i_10 != 32)
    goto <bb 7>;
  else
    goto <bb 5>;

  <bb 7>:
  goto <bb 3>;

  <bb 5>:
  return;

  <bb 6>:
  goto <bb 4>;

with bb6 being out-of-line at the end of the function, rather than bb4 falling through from just above bb5. However, this prevents dom1 from doing the partial peeling, and dom1 only changes the "goto bb7" into a "goto bb3":

  <bb 2>:

  <bb 3>:
  # i_15 = PHI <i_10(4), 0(2)>
  _6 = a[i_15];
  _7 = i_15 & _6;
  if (_7 != 0)
    goto <bb 4>;
  else
    goto <bb 6>;

  <bb 4>:
  # m_3 = PHI <4(6), 5(3)>
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  if (i_10 != 32)
    goto <bb 3>;
  else
    goto <bb 5>;

  <bb 5>:
  return;

  <bb 6>:
  goto <bb 4>;

Existing tree_if_conversion deals with this just fine, feeding into the 
vectorizer:

  <bb 2>:

  <bb 3>:
  # i_15 = PHI <i_10(4), 0(2)>
  # ivtmp_12 = PHI <ivtmp_1(4), 32(2)>
  _6 = a[i_15];
  _7 = _6 & i_15;
  m_3 = _7 != 0 ? 5 : 4;
  _8 = m_3 * _6;
  b[i_15] = _8;
  i_10 = i_15 + 1;
  ivtmp_1 = ivtmp_12 - 1;
  if (ivtmp_1 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

  <bb 4>:
  goto <bb 3>;

Which is vectorized OK (the only PHI is the induction variable i_15, a true scalar cycle).

So, besides the if conversion, I see two issues here:

(1) dom1 should really, in the second case, perform the same partial peeling that it does in the first testcase, if that is what it thinks is desirable. (Of course, we might want to fix that only later, as ATM that'd take us backwards).

(2) somehow, gcc should vectorize loops that have been partially peeled in that way. The only way I can really see, is to partially peel the same portion of the first (vectorization factor - 1) loop iterations too, so we can do a PHI of a whole vector at a time, and so I'm wondering if anyone can give me any pointers here - am I barking up the right tree - and is it reasonable to persuade existing vectorizer loop-peeling code (e.g. for alignment) to do this for us too, or would anyone recommend a different avenue?

Alternatively, maybe we don't want dom1 doing that sort of thing (?), but I'm inclined to think that if it's doing such optimizations, it's for a good reason ;) I guess there'll be other times where we *cannot* do partially peeling of later iterations...

Thoughts?

Thanks, Alan

Reply via email to