For the following code: --------------------------- uint8_t data[16]; static __attribute__((noinline)) void test(unsigned i) { unsigned j; for (j = 0; j < 16; j++) data[j] = (i + j) >> 8; } ---------------------------
code generated with -O3 -ftree-vectorize is ~25% slower than with -O3 -fno-tree-vectorize for gcc 4.4 and 4.5. 4.3 and older don't vectorize this code. Command line: gcc tst2a.c -o tst2.o -O3 -march=k8 -fno-tree-vectorize gcc tst2a.c -o tst2.o -O3 -march=k8 -ftree-vectorize (using -m32 -fomit-frame-pointer has no significant effect on performance) Tested versions (average time in ticks, 1<<24 loops): 3.4.6 (gentoo) - (66 ticks) very slow, probably doesn't unroll the loop (I haven't looked at the code) 4.1.2 - 4.3.3 (gentoo) - (20 ticks) doesn't autovectorize even when -ftree-vectorize is specified 4.4.0 (gentoo) - (20 without vectorizing, 30 with) 4.5.0 (r149701) - (19 ticks / 24 ticks) non-vectorized code is faster by 1 tick with -march=k8 than with -march=barcelona (even when my arch is barcelona) (I am reporting this only against 4.5.0 since I don't have vanilla 4.4.0 and older) Tests were repeated several times, run with highest priority and with affinity set to one core. CPU is AMD Phenom (4 cores, Barcelona) running at fixed 1400MHz. Attached is code including whole test code. -- Summary: generated code is ~25% slower when autovectorization is enabled Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: zsojka at seznam dot cz GCC host triplet: x86_64-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40771