Re: FRE may run out of memory

2014-02-08 Thread Andrew Pinski
On Fri, Feb 7, 2014 at 11:29 PM, dxq  wrote:
> hi all,
>
> We found that gcc would run out of memory on Windows when compiling a *big*
> function (10 lines).

My suggestion to you is file a bug to http://gcc.gnu.org/bugzilla with
the preprocessed source.  Also with the exact version of GCC you have
tried.  There have been some improvements with the extreme testcases;
at least on the trunk of GCC sources.

Thanks,
Andrew Pinski

>
> More investigation shows that gcc crashes at the function *compute_avail*,
> in tree-fre pass.  *compute_avail* collects information from basic blocks,
> so memory is allocated to record informantion.
> However, if there are huge number of basic blocks,  the memory would be
> exhausted and gcc would crash down, especially for Windows PC, only 2G or 4G
> memory generally. It's ok On linux, and *compute_avail* allocates *2.4G*
> memory. I guess some optimization passes in gcc like FRE didn't consider the
> extreme
> case.
>
> When disable tree-fre pass, gcc crashes at IRA pass.  I will do more
> investigation about that.
>
> Any suggestions?
>
> Thanks!
>
> danxiaoqiang
>
>
>
> --
> View this message in context: 
> http://gcc.1065356.n5.nabble.com/FRE-may-run-out-of-memory-tp1009578.html
> Sent from the gcc - patches mailing list archive at Nabble.com.


Re: -O3 and -ftree-vectorize

2014-02-08 Thread Tim Prince


On 2/7/2014 11:09 AM, Tim Prince wrote:


On 02/07/2014 10:22 AM, Jakub Jelinek wrote:

On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:

I'm seeing vectorization  but no output from
-ftree-vectorizer-verbose, and no dot product vectorization inside
omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it
works with gfortran and g++, so use gcc -O2 -ftree-vectorize
together with additional optimization flags which don't break.
Can you file a GCC bugzilla PR with minimal testcases for this (or 
point us

at already filed bugreports)?
The question of problems with gcc -O3 (called from gfortran) have 
eluded me as to finding a minimal test case.  When I run under debug, 
it appears that somewhere prior to the crash some gfortran code is 
over-written with data by the gcc code, overwhelming my debugging 
skill.  I can get full performance with -O2 plus a bunch of 
intermediate flags.
As to non-vectorization of dot product in omp parallel region, 
-fopt-info (which I didn't know about) is reporting vectorization, but 
there are no parallel simd instructions in the generated code for the 
omp_fn.  I'll file a PR on that if it's still reproduced in a minimal 
case.





I've made source code changes to take advantage of the new
vectorization with merge() and ? operators; while it's useful for
-march=core-avx2, it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably
effective in my tests only on 12 or more cores.

Likewise.
Those are cases of 2 levels of loops from netlib "vector" benchmark 
where only one level is vectorizable and parallelizable. By putting 
the vectorizable loop on the outside the parallelization scales to a 
large number of cores.  I don't expect it to out-perform single thread 
optimized avx vectorization until 8 or more cores are in use, but it 
needs more than expected number of threads even relative to SSE 
vectorization.



#pragma omp simd reduction(max: ) is giving correct results but poor
performance in my tests.

Likewise.
I'll file a PR on this, didn't know if there might be interest.  I 
have an Intel compiler issue "closed, will not be fixed" so the simd 
reduction(max: ) isn't viable for icc in the near term.

Thanks,

With further investigation, my case with reverse_copy outside and 
inner_product inside an omp parallel region is working very well with 
-O3 -ffast-math for double data type.  There seems a possible 
performance problem with reverse_copy for float data type, so much so 
that gfortran does better with the loop reversal pushed down into the 
parallel dot_products.  I have seen at least 2 cases where the new gcc 
vectorization of stride -1 with vpermd is superior to other compilers, 
even for float data type.
For the cases where omp parallel for simd is set in expectation of 
gaining outer loop parallel simd, gcc is ignoring the simd clause. So it 
is understandable that a large number of cores is needed to overcome the 
lack of parallel simd (other than by simd intrinsics coding).

I'll choose an example of omp simd reduction(max: ) for a PR.
Thanks.

--
Tim Prince



gcc-4.7-20140208 is now available

2014-02-08 Thread gccadmin
Snapshot gcc-4.7-20140208 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/4.7-20140208/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.7 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_7-branch 
revision 207638

You'll find:

 gcc-4.7-20140208.tar.bz2 Complete GCC

  MD5=f432b3c20df7055b4bf91b8128c8b1a9
  SHA1=2f780b48b328f395a98847a74bb13fd21aecfe2d

Diffs from 4.7-20140201 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.7
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.