Hi,
I've been experimenting with some small testcases with the aim of getting the
vectorizer to work on more loops. I started by compiling at -O3 the following
testcase:
#define N 32
int a[N];
int b[N];
int foo ()
{
for (int i = 0; i < N ; i++)
{
int m = (a[i] & i) ? 5 : 4;
b[i] = a[i] * m;
}
}
After copyrename3, immediately prior to dom1, the loop body looks like:
<bb 2>:
<bb 3>:
# i_11 = PHI <i_9(7), 0(2)>
_5 = a[i_11];
_6 = i_11 & _5;
if (_6 != 0)
goto <bb 4>;
else
goto <bb 5>;
<bb 4>:
<bb 5>:
# m_2 = PHI <5(4), 4(3)>
_7 = m_2 * _5;
b[i_11] = _7;
i_9 = i_11 + 1;
if (i_9 != 32)
goto <bb 7>;
else
goto <bb 6>;
<bb 7>:
goto <bb 3>;
<bb 6>:
return;
dom1 then peeled part of the first loop iteration, producing:
<bb 2>:
goto <bb 8>;
<bb 3>:
# i_11 = PHI <i_9(6)>
_5 = a[i_11];
_6 = i_11 & _5;
if (_6 != 0)
goto <bb 4>;
else
goto <bb 5>;
<bb 4>:
<bb 5>:
# m_14 = PHI <5(4), 4(3)>
<bb 6>:
# m_2 = PHI <m_14(5), 4(8)>
# _15 = PHI <_5(5), _10(8)>
# i_16 = PHI <i_11(5), i_1(8)>
_7 = m_2 * _15;
b[i_16] = _7;
i_9 = i_16 + 1;
if (i_9 != 32)
goto <bb 3>;
else
goto <bb 7>;
<bb 7>:
return;
<bb 8>:
# i_1 = PHI <0(2)>
_10 = a[i_1];
_3 = i_1 & _10;
goto <bb 6>;
which following ivcanon, had been simplified down to:
<bb 2>:
_10 = a[0];
goto <bb 6>;
<bb 3>:
_5 = a[i_9];
_6 = _5 & i_9;
if (_6 != 0)
goto <bb 5>;
else
goto <bb 4>;
<bb 4>:
<bb 5>:
# m_14 = PHI <5(3), 4(4)>
<bb 6>:
# m_2 = PHI <m_14(5), 4(2)>
# _15 = PHI <_5(5), _10(2)>
# i_16 = PHI <i_9(5), 0(2)>
# ivtmp_13 = PHI <ivtmp_3(5), 32(2)>
_7 = m_2 * _15;
b[i_16] = _7;
i_9 = i_16 + 1;
ivtmp_3 = ivtmp_13 - 1;
if (ivtmp_3 != 0)
goto <bb 3>;
else
goto <bb 7>;
<bb 7>:
return;
tree-if-conversion wouldn't deal with such a loop, and so the control flow made
the vectorizer bail out too; resulting in scalar code (on both x86_64 and
AArch64). I am currently testing a patch to tree-if-conv.c that makes it work on
such a CFG, producing:
<bb 2>:
_10 = a[0];
goto <bb 4>;
<bb 3>:
<bb 4>:
# m_2 = PHI <m_14(3), 4(2)>
# _15 = PHI <_5(3), _10(2)>
# i_16 = PHI <i_9(3), 0(2)>
# ivtmp_13 = PHI <ivtmp_3(3), 32(2)>
_7 = m_2 * _15;
b[i_16] = _7;
i_9 = i_16 + 1;
ivtmp_3 = ivtmp_13 - 1;
_5 = a[i_9];
_6 = _5 & i_9;
m_14 = _6 != 0 ? 5 : 4;
if (ivtmp_3 != 0)
goto <bb 3>;
else
goto <bb 5>;
<bb 5>:
return;
However, this still fails to vectorize: the PHI nodes at the start of bb 4
prevent this (in vect_analyze_scalar_cycles: they look a bit like reductions,
but aren't).
In contrast, a slightly-different testcase:
#define N 32
int a[N];
int b[N];
int foo ()
{
for (int i = 0; i < N ; i++)
{
int cond = (a[i] & i) ? -1 : 0; // extra variable here
int m = (cond) ? 5 : 4;
b[i] = a[i] * m;
}
}
after copyrename3, just before dom1, is only slightly different:
<bb 2>:
<bb 3>:
# i_15 = PHI <i_10(7), 0(2)>
_6 = a[i_15];
_7 = i_15 & _6;
if (_7 != 0)
goto <bb 4>;
else
goto <bb 6>;
<bb 4>:
# m_3 = PHI <4(6), 5(3)>
_8 = m_3 * _6;
b[i_15] = _8;
i_10 = i_15 + 1;
if (i_10 != 32)
goto <bb 7>;
else
goto <bb 5>;
<bb 7>:
goto <bb 3>;
<bb 5>:
return;
<bb 6>:
goto <bb 4>;
with bb6 being out-of-line at the end of the function, rather than bb4 falling
through from just above bb5. However, this prevents dom1 from doing the partial
peeling, and dom1 only changes the "goto bb7" into a "goto bb3":
<bb 2>:
<bb 3>:
# i_15 = PHI <i_10(4), 0(2)>
_6 = a[i_15];
_7 = i_15 & _6;
if (_7 != 0)
goto <bb 4>;
else
goto <bb 6>;
<bb 4>:
# m_3 = PHI <4(6), 5(3)>
_8 = m_3 * _6;
b[i_15] = _8;
i_10 = i_15 + 1;
if (i_10 != 32)
goto <bb 3>;
else
goto <bb 5>;
<bb 5>:
return;
<bb 6>:
goto <bb 4>;
Existing tree_if_conversion deals with this just fine, feeding into the
vectorizer:
<bb 2>:
<bb 3>:
# i_15 = PHI <i_10(4), 0(2)>
# ivtmp_12 = PHI <ivtmp_1(4), 32(2)>
_6 = a[i_15];
_7 = _6 & i_15;
m_3 = _7 != 0 ? 5 : 4;
_8 = m_3 * _6;
b[i_15] = _8;
i_10 = i_15 + 1;
ivtmp_1 = ivtmp_12 - 1;
if (ivtmp_1 != 0)
goto <bb 4>;
else
goto <bb 5>;
<bb 4>:
goto <bb 3>;
Which is vectorized OK (the only PHI is the induction variable i_15, a true
scalar cycle).
So, besides the if conversion, I see two issues here:
(1) dom1 should really, in the second case, perform the same partial peeling
that it does in the first testcase, if that is what it thinks is desirable. (Of
course, we might want to fix that only later, as ATM that'd take us backwards).
(2) somehow, gcc should vectorize loops that have been partially peeled in that
way. The only way I can really see, is to partially peel the same portion of the
first (vectorization factor - 1) loop iterations too, so we can do a PHI of a
whole vector at a time, and so I'm wondering if anyone can give me any pointers
here - am I barking up the right tree - and is it reasonable to persuade
existing vectorizer loop-peeling code (e.g. for alignment) to do this for us
too, or would anyone recommend a different avenue?
Alternatively, maybe we don't want dom1 doing that sort of thing (?), but I'm
inclined to think that if it's doing such optimizations, it's for a good reason
;) I guess there'll be other times where we *cannot* do partially peeling of
later iterations...
Thoughts?
Thanks, Alan