Re: -O3 and -ftree-vectorize

Tim Prince Fri, 07 Feb 2014 08:10:18 -0800


On 02/07/2014 10:22 AM, Jakub Jelinek wrote:

On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:

I'm seeing vectorization  but no output from
-ftree-vectorizer-verbose, and no dot product vectorization inside
omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it
works with gfortran and g++, so use gcc -O2 -ftree-vectorize
together with additional optimization flags which don't break.

Can you file a GCC bugzilla PR with minimal testcases for this (or point us
at already filed bugreports)?

The question of problems with gcc -O3 (called from gfortran) have eludedme as to finding a minimal test case. When I run under debug, itappears that somewhere prior to the crash some gfortran code isover-written with data by the gcc code, overwhelming my debuggingskill. I can get full performance with -O2 plus a bunch of intermediateflags.As to non-vectorization of dot product in omp parallel region,-fopt-info (which I didn't know about) is reporting vectorization, butthere are no parallel simd instructions in the generated code for theomp_fn. I'll file a PR on that if it's still reproduced in a minimal case.

I've made source code changes to take advantage of the new
vectorization with merge() and ? operators; while it's useful for
-march=core-avx2, it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably
effective in my tests only on 12 or more cores.

Likewise.

Those are cases of 2 levels of loops from netlib "vector" benchmarkwhere only one level is vectorizable and parallelizable. By putting thevectorizable loop on the outside the parallelization scales to a largenumber of cores. I don't expect it to out-perform single threadoptimized avx vectorization until 8 or more cores are in use, but itneeds more than expected number of threads even relative to SSEvectorization.

#pragma omp simd reduction(max: ) is giving correct results but poor
performance in my tests.

Likewise.

I'll file a PR on this, didn't know if there might be interest. I havean Intel compiler issue "closed, will not be fixed" so the simdreduction(max: ) isn't viable for icc in the near term.

Thanks,

Re: -O3 and -ftree-vectorize

Reply via email to