http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442
--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-06-16 16:33:16 UTC --- I was testing on SandyBridge, but it was reported to us for Core2. The loop used to be vectorized in 4.4 and is now too, in both cases it does a huge hard to decipher test with many conditions and either uses non-vectorized loop or vectorized loop. In the r148210 condition was also: vect_p.44_29 = (vector double *) out1_6(D); addr2int0.45_28 = (long int) vect_p.44_29; vect_p.48_37 = (vector double *) out2_21(D); addr2int1.49_38 = (long int) vect_p.48_37; orptrs1.50_40 = addr2int0.45_28 | addr2int1.49_38; vect_p.53_41 = (vector double *) out3_34(D); addr2int2.54_42 = (long int) vect_p.53_41; orptrs2.55_51 = orptrs1.50_40 | addr2int2.54_42; andmask.56_52 = orptrs2.55_51 & 15; ... D.2833_72 = andmask.56_52 == 0; but in the new condition is not, and previously it used movaps stores in the loop: movapd %xmm0, (%rdi,%r10) ... movapd %xmm0, (%rsi,%r10) ... movapd %xmm0, (%rdx,%r10) while newly it uses: movlpd %xmm0, (%rdi,%rbx) movhpd %xmm0, 8(%rdi,%rbx) ... movlpd %xmm0, (%rsi,%rbx) movhpd %xmm0, 8(%rsi,%rbx) ... movlpd %xmm0, (%rdx,%rbx) movhpd %xmm0, 8(%rdx,%rbx) Surprisingly, the new code is slower even when the pointers aren't aligned: r128110: Strip out best and worst realtime result minimum: 8.849950347 sec real / 0.000085810 sec CPU maximum: 9.278652529 sec real / 0.000153471 sec CPU average: 9.055898562 sec real / 0.000138755 sec CPU stdev : 0.073603342 sec real / 0.000016469 sec CPU r128111: Strip out best and worst realtime result minimum: 12.089365836 sec real / 0.000081233 sec CPU maximum: 12.378188295 sec real / 0.000158253 sec CPU average: 12.234883839 sec real / 0.000136920 sec CPU stdev : 0.073461527 sec real / 0.000017463 sec CPU (same baz routine, and double a[60000] __attribute__((aligned (32))); int main () { int i; for (i = 0; i < 500000; i++) baz (a + 1, a + 10001, a + 30000, a + 40000, a + 50000, 10000); return 0; } instead). Here, in r128110 generated code it uses the scalar loop, while in r128111 it uses the vectorized one with those movlpd+movhpd stores. So in this particular case for this particular CPU, it would be better if the cost model said that it should verify whether all store pointers are sufficiently aligned and only use the vectorized loop in that case. BTW, the vectorization condition is really long, is it a good idea to let it go through with just a single branch at the end? Wouldn't it be better to test several most likely to fail checks first, conditional branch, then some other tests, again conditional branch? I've talked with Richard on IRC about how users could promise the compiler that the pointers are sufficiently aligned and thus it can just assume it is aligned (if it would test for it) and use it in the loop, both for loads and stores. Possibilities include __attribute__((ptr_align (align [, misalign]))) on const pointer parameters and const pointer variables, or adding __builtin_unreachable () using assertions. But now that I think about it more, we already version the loop for vectorization in this case, wouldn't it be better to just add some extension which would allow the user to say something is likely? Such hint could be e.g. hint that some pointer is likely to be so and so aligned/misaligned, or e.g. that pointers don't overlap (yeah, I know, we have restrict, but e.g. on STL containers it is more fun to add those)? E.g. if this loop was hinted that all 5 pointers are 16 byte aligned and that neither in1[0..len-1] nor in2[0..len-1] overlap out{1,2,3}[0..len-1], the vectorizer could verify those conditions at runtime and use an correct alignment and __restrict assuming faster vectorized loop, while for the fallback case (vectorization not beneficial, or some overlaps somewhere, or misaligned pointers) would be a scalar loop not assuming anything of that. Or perhaps the hints could tell the vectorizer to emit 3 different versions instead of two, each with different assumptions or something similar.