For the following code:

---------------------------
uint8_t data[16];
static __attribute__((noinline)) void test(unsigned i)
{
        unsigned j;
        for (j = 0; j < 16; j++)
                data[j] = (i + j) >> 8;
}
---------------------------

code generated with -O3 -ftree-vectorize is ~25% slower than with -O3
-fno-tree-vectorize for gcc 4.4 and 4.5. 4.3 and older don't vectorize this
code.

Command line:
gcc tst2a.c -o tst2.o -O3 -march=k8 -fno-tree-vectorize
gcc tst2a.c -o tst2.o -O3 -march=k8 -ftree-vectorize
(using -m32 -fomit-frame-pointer has no significant effect on performance)

Tested versions (average time in ticks, 1<<24 loops):
3.4.6 (gentoo) - (66 ticks) very slow, probably doesn't unroll the loop (I
haven't looked at the code)
4.1.2 - 4.3.3 (gentoo) - (20 ticks) doesn't autovectorize even when
-ftree-vectorize is specified
4.4.0 (gentoo) - (20 without vectorizing, 30 with)
4.5.0 (r149701) - (19 ticks / 24 ticks) non-vectorized code is faster by 1 tick
with -march=k8 than with -march=barcelona (even when my arch is barcelona)

(I am reporting this only against 4.5.0 since I don't have vanilla 4.4.0 and
older)
Tests were repeated several times, run with highest priority and with affinity
set to one core.

CPU is AMD Phenom (4 cores, Barcelona) running at fixed 1400MHz.

Attached is code including whole test code.


-- 
           Summary: generated code is ~25% slower when autovectorization is
                    enabled
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: zsojka at seznam dot cz
  GCC host triplet: x86_64-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40771

Reply via email to