On 02/07/2014 10:22 AM, Jakub Jelinek wrote:
On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:
I'm seeing vectorization but no output from
-ftree-vectorizer-verbose, and no dot product vectorization inside
omp parallel regions, with gcc g++ or gfortran 4.9. Primary targets
are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it
works with gfortran and g++, so use gcc -O2 -ftree-vectorize
together with additional optimization flags which don't break.
Can you file a GCC bugzilla PR with minimal testcases for this (or point us
at already filed bugreports)?
The question of problems with gcc -O3 (called from gfortran) have eluded
me as to finding a minimal test case. When I run under debug, it
appears that somewhere prior to the crash some gfortran code is
over-written with data by the gcc code, overwhelming my debugging
skill. I can get full performance with -O2 plus a bunch of intermediate
flags.
As to non-vectorization of dot product in omp parallel region,
-fopt-info (which I didn't know about) is reporting vectorization, but
there are no parallel simd instructions in the generated code for the
omp_fn. I'll file a PR on that if it's still reproduced in a minimal case.
I've made source code changes to take advantage of the new
vectorization with merge() and ? operators; while it's useful for
-march=core-avx2, it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably
effective in my tests only on 12 or more cores.
Likewise.
Those are cases of 2 levels of loops from netlib "vector" benchmark
where only one level is vectorizable and parallelizable. By putting the
vectorizable loop on the outside the parallelization scales to a large
number of cores. I don't expect it to out-perform single thread
optimized avx vectorization until 8 or more cores are in use, but it
needs more than expected number of threads even relative to SSE
vectorization.
#pragma omp simd reduction(max: ) is giving correct results but poor
performance in my tests.
Likewise.
I'll file a PR on this, didn't know if there might be interest. I have
an Intel compiler issue "closed, will not be fixed" so the simd
reduction(max: ) isn't viable for icc in the near term.
Thanks,