4.7 Regression] Misaligned store support pessimization

jakub at gcc dot gnu.org Thu, 16 Jun 2011 09:34:40 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442


--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-06-16 
16:33:16 UTC ---
I was testing on SandyBridge, but it was reported to us for Core2.
The loop used to be vectorized in 4.4 and is now too, in both cases it does a
huge hard to decipher test with many conditions and either uses non-vectorized
loop or vectorized loop.  In the r148210 condition was also:
  vect_p.44_29 = (vector double *) out1_6(D);
  addr2int0.45_28 = (long int) vect_p.44_29;
  vect_p.48_37 = (vector double *) out2_21(D);
  addr2int1.49_38 = (long int) vect_p.48_37;
  orptrs1.50_40 = addr2int0.45_28 | addr2int1.49_38;
  vect_p.53_41 = (vector double *) out3_34(D);
  addr2int2.54_42 = (long int) vect_p.53_41;
  orptrs2.55_51 = orptrs1.50_40 | addr2int2.54_42;
  andmask.56_52 = orptrs2.55_51 & 15;
...
  D.2833_72 = andmask.56_52 == 0;
but in the new condition is not, and previously it used movaps stores in the
loop:
    movapd    %xmm0, (%rdi,%r10)
...
    movapd    %xmm0, (%rsi,%r10)
...
    movapd    %xmm0, (%rdx,%r10)
while newly it uses:
    movlpd    %xmm0, (%rdi,%rbx)
    movhpd    %xmm0, 8(%rdi,%rbx)
...
    movlpd    %xmm0, (%rsi,%rbx)
    movhpd    %xmm0, 8(%rsi,%rbx)
...
    movlpd    %xmm0, (%rdx,%rbx)
    movhpd    %xmm0, 8(%rdx,%rbx)

Surprisingly, the new code is slower even when the pointers aren't aligned:
r128110:
Strip out best and worst realtime result
minimum: 8.849950347 sec real / 0.000085810 sec CPU
maximum: 9.278652529 sec real / 0.000153471 sec CPU
average: 9.055898562 sec real / 0.000138755 sec CPU
stdev  : 0.073603342 sec real / 0.000016469 sec CPU
r128111:
Strip out best and worst realtime result
minimum: 12.089365836 sec real / 0.000081233 sec CPU
maximum: 12.378188295 sec real / 0.000158253 sec CPU
average: 12.234883839 sec real / 0.000136920 sec CPU
stdev  : 0.073461527 sec real / 0.000017463 sec CPU
(same baz routine, and
double a[60000] __attribute__((aligned (32)));
int
main ()
{
  int i;
  for (i = 0; i < 500000; i++)
    baz (a + 1, a + 10001, a + 30000, a + 40000, a + 50000, 10000);
  return 0;
}
instead).  Here, in r128110 generated code it uses the scalar loop, while in
r128111 it uses the vectorized one with those movlpd+movhpd stores.
So in this particular case for this particular CPU, it would be better if the
cost model said that it should verify whether all store pointers are
sufficiently aligned and only use the vectorized loop in that case.

BTW, the vectorization condition is really long, is it a good idea to let it go
through with just a single branch at the end?  Wouldn't it be better to test
several most likely to fail checks first, conditional branch, then some other
tests, again conditional branch?

I've talked with Richard on IRC about how users could promise the compiler
that the pointers are sufficiently aligned and thus it can just assume it is
aligned (if it would test for it) and use it in the loop, both for loads and
stores.  Possibilities include __attribute__((ptr_align (align [, misalign])))
on const pointer parameters and const pointer variables, or adding
__builtin_unreachable () using assertions.

But now that I think about it more, we already version the loop for
vectorization in this case, wouldn't it be better to just add some extension
which would allow the user to say something is likely?  Such hint could be
e.g. hint that some pointer is likely to be so and so aligned/misaligned,
or e.g. that pointers don't overlap (yeah, I know, we have restrict, but
e.g. on STL containers it is more fun to add those)?

E.g. if this loop was hinted that all 5 pointers are 16 byte aligned and
that neither in1[0..len-1] nor in2[0..len-1] overlap out{1,2,3}[0..len-1], the
vectorizer could verify those conditions at runtime and use an correct
alignment
and __restrict assuming faster vectorized loop, while for the fallback case
(vectorization not beneficial, or some overlaps somewhere, or misaligned
pointers) would be a scalar loop not assuming anything of that.
Or perhaps the hints could tell the vectorizer to emit 3 different versions
instead of two, each with different assumptions or something similar.

[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

Reply via email to