Hi Jacob

Thank you very much for the suggestions and insight.

On an idle node MemFree is about 15599152 kB (14.8GB).
Applying the "80%" rule to it, I get a problem size N=38,440.
However, the HPL run fails with the memory leak problem
even if I use N=35,000,
with openib among the MCA btl parameters.
You may have seen another message by Brian Barret explaining a possible
reason for the problem, and suggesting a workaround.
I haven't tried it yet, but I will.

I read about the HPL preference for "square" PxQ processor grids.
On a single node the fastest runs are 2x4,
but 1x8 is often times competitive also, coming second or third,
although it is not "square" at all.
I would guess this has much to do with
the physical 2-socket-4-core layout, or not?
I would also guess that the best processor grid is likely to
be quite different when the whole cluster is used, right?
How can one use the 2x4 fastest processor grid layout on a single node
to infer the fastest processor grid for the cluster?

The best I got so far was 80% efficiency, less than your "at least 85%".
So, I certainly have more work to do.

GotoBLAS was compiled with Gnu, no special optimization flags,
other than what the distribution Makefiles already have.
OpenMPI was also compiled with Gnu, but I used the CFLAGS=FFLAGS=:

-march=amdfam10 -O3 -finline-functions -funroll-loops -mfpmath=sse

As I used mpicc and mpif77 to compile HPL, I presume it inherited these
flags also, right?

However, I already read comments on other mailing lists
that "-march=adfam10" is not really the best choice for
Barcelona (and I wonder if it is for Shanghai),
although gcc says it tailored for that architecture.
What "-march" is really the fastest?

Any suggestions in this area of compilers and optimization?

Many thanks,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

jacob_liber...@dell.com wrote:
Hi Gus,

For single node runs, don't bother specifying the btl.  Openmpi should select 
the best option.

Beyond that, the "80% total RAM" recommendation is misleading. Base your N off the memfree rather than memtotal. IB can reserve quite a bit. Verify your /etc/security/limits.conf limits allow sufficient locking. (Try unlimited)
Finally, P should be smaller than Q, and squarer values are recommended.

With Shanghai, OpenMPI, GotoBLAS expect single node efficiency of a least 85% 
given decent tuning.  If the distribution continues to look strange, there are 
more things to check.

Thanks, Jacob

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Gus Correa
Sent: Friday, May 01, 2009 12:17 PM
To: Open MPI Users
Subject: [OMPI users] HPL with OpenMPI: Do I have a memory leak?

Hi OpenMPI and HPC experts

This may or may not be the right forum to post this,
and I am sorry to bother those that think it is not.

I am trying to run the HPL benchmark on our cluster,
compiling it with Gnu and linking to
GotoBLAS (1.26) and OpenMPI (1.3.1),
both also Gnu-compiled.

I have got failures that suggest a memory leak when the
problem size is large, but still within the memory limits
recommended by HPL.
The problem only happens when "openib" is among the OpenMPI
MCA parameters (and the problem size is large).
Any help is appreciated.

Here is a description of what happens.

For starters I am trying HPL on a single node, to get a feeling for
the right parameters (N & NB, P & Q, etc) on dual-socked quad-core
AMD Opteron 2376 "Shanghai"

The HPL recommendation is to use close to 80% of your physical memory,
to reach top Gigaflop performance.
Our physical memory on a node is 16GB, and this gives a problem size
N=40,000 to keep the 80% memory use.
I tried several block sizes, somewhat correlated to the size of the
processor cache:  NB=64 80 96 128 ...

When I run HPL with N=20,000 or smaller all works fine,
and the HPL run completes, regardless of whether "openib"
is present or not on my MCA parameters.

However, moving when I move N=40,000, or even N=35,000,
the run starts OK with NB=64,
but as NB is switched to larger values
the total memory use increases in jumps (as shown by Ganglia),
and becomes uneven across the processors (as shown by "top").
The problem happens if "openib" is among the MCA parameters,
but doesn't happen if I remove "openib" from the MCA list and use
only "sm,self".

For N=35,000, when NB reaches 96 memory use is already above the
physical limit
(16GB), having increased from 12.5GB to over 17GB.
For N=40,000 the problem happens even earlier, with NB=80.
At this point memory swapping kicks in,
and eventually the run dies with memory allocation errors:

=======================================================================
=========
T/V                N    NB     P     Q               Time
   Gflops
-----------------------------------------------------------------------
---------
WR01L2L4       35000   128     8     1             539.66
5.297e+01
-----------------------------------------------------------------------
---------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0043992
...... PASSED
HPL ERROR from process # 0, on line 172 of function HPL_pdtest:
 >>> [7,0] Memory allocation failed for A, x and b. Skip. <<<
...

***

The code snippet that corresponds to HPL_pdest.c is this,
although the leak is probably somewhere else:

/*
  * Allocate dynamic memory
  */
    vptr = (void*)malloc( ( (size_t)(ALGO->align) +
                            (size_t)(mat.ld+1) * (size_t)(mat.nq) ) *
                          sizeof(double) );
    info[0] = (vptr == NULL); info[1] = myrow; info[2] = mycol;
    (void) HPL_all_reduce( (void *)(info), 3, HPL_INT, HPL_max,
                           GRID->all_comm );
    if( info[0] != 0 )
    {
       if( ( myrow == 0 ) && ( mycol == 0 ) )
          HPL_pwarn( TEST->outfp, __LINE__, "HPL_pdtest",
                     "[%d,%d] %s", info[1], info[2],
                     "Memory allocation failed for A, x and b. Skip." );
       (TEST->kskip)++;
       return;
    }

***

I found this continued increase in memory use rather strange,
and suggestive of a memory leak in one of the codes being used.

Everything (OpenMPI, GotoBLAS, and HPL)
was compiled using Gnu only (gcc, gfortran, g++).

I haven't changed anything on the compiler's memory model,
i.e., I haven't used or changed the "-mcmodel" flag of gcc
(I don't know if the Makefiles on HPL, GotoBLAS, and OpenMPI use it.)

No additional load is present on the node,
other than the OS (Linux CentOS 5.2), HPL is running alone.

The cluster has Infiniband.
However, I am running on a single node.

The surprising thing is that if I run on shared memory only
(-mca btl sm,self) there is no memory problem,
the memory use is stable at about 13.9GB,
and the run completes.
So, there is a way around to run on a single node.
(Actually shared memory is presumably the way to go on a single node.)

However, if I introduce IB (-mca btl openib,sm,self)
among the MCA btl parameters, then memory use blows up.

This is bad news for me, because I want to extend the experiment
to run HPL also across the whole cluster using IB,
which is actually the ultimate goal of HPL, of course!
It also suggests that the problem is somehow related to Infiniband,
maybe hidden under OpenMPI.

Here is the mpiexec command I use (with and without openib):

/path/to/openmpi/bin/mpiexec \
         -prefix /the/run/directory \
         -np 8 \
         -mca btl [openib,]sm,self \
         xhpl


Any help, insights, suggestions, reports of previous experiences,
are much appreciated.

Thank you,
Gus Correa
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to