Hi Ilya

OK, with 28 nodes and 4GB/node,
you have much more memory than I thought.
The maximum N is calculated based on the total memory
you have (assuming the nodes are homogeneous, have the same RAM),
not based on the memory per node.

I haven't tried OpenMPI 1.3.3.
Last I ran HPL was with OpenMPI 1.3.2.
It worked fine.
It also worked with OpenMPI 1.3.1 and 1.3.0, but these versions
had a problem that caused memory leaks (at least on Infiniband, not sure about Ethernet).
The problem was fixed in later OpenMPI versions (1.3.2. and newer).
In any case, there was even a workaround in the command line for that
("-mca mpi_leave_pinned 0") for 1.3.0 and 1.3.1.
However, AFAIK, this workaround is not needed for 1.3.2 and newer.

What is your OpenMPI mpiexec command line?

Is it possible that you somehow mixed a 32-bit machine/OpenMPI build
with a 64-bit machine/OpenMPI build.
For instance, your head node (where you compiled the code) is
64-bit, but the compute nodes - or some of them - are 32-bit,
or vice-versa?
The error messages you posted hint something like that,
a mix of MPI_DOUBLE  types, MPI_Aint types, etc.

Also, make sure no mpi.h/mpif.h is hardwired into your HPL code,
or into the supporting libraries BLAS/LAPACK, or Goto BLAS,
or ATLAS, etc.
Those include files are NOT portable across MPI flavors,
and a source of frustration when hardwired into the code.

Furthermore, make sure you don't have leftover HPL processes
from old runs hanging on the compute nodes.
That is a common cause of trouble.
Would this be the reason for the problems you saw?

Good luck.

I hope it helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

ilya zelenchuk wrote:
Hello, Gus!

Sorry for the lack of debug info.
I have 28 nodes. Each node have 2 processors Xeon 2.4 GHz with 4 Gb RAM.
OpenMPI 1.3.3 was compiled as:
CC=icc CFLAGS=" -O3" CXX=icpc CXXFLAGS=" -O3" F77=ifort FFLAGS=" -O3"
FC=ifort FCFLAGS=" -O3" ./configure --prefix=/opt/openmpi/intel/
--enable-debug --enable-mpi-threads --disable-ipv6

2009/12/28 Gus Correa <g...@ldeo.columbia.edu>:
Hi Ilya

Did you recompile HPL with OpenMPI, or just launched the MPICH2
executable with the OpenMPI mpiexec?
You probably know this, but you cannot mix different MPIs at
compile and run time.
Yes, I know this bug. I compile HPL with OpenMPI and run with this one.
For the MPICH2, I recompile HPL.

Also, the HPL maximum problem size (N) depends on how much
memory/RAM you have.
If you make N too big, the arrays don't fit in the RAM,
you get into memory paging, which is no good for MPI.
How much RAM do you have?
4 Gb on each node. Also, I've watched meminfo through the top.
No swap.

N=17920 would require about 3.2GB, if I did the math right.
A rule of thumb is maximum N = sqrt(0.8 * RAM_in_bytes / 8)
Have you tried smaller values (above 10000, but below 17920)?
For which N does it start to break?
With 8960 work's almost fine. I have same errors just once. They
disappear after rebooting cluster :)
But if I set problem size to 11200 - errors comes again. At this point
rebooting doesn't help.

BTW: in output i have:

===
type 11 count ints 82 count disp 81 count datatype 81
ints:     81 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 6481
MPI_Aint: 0 22400 44800 67200 89600 112000 134400 156800 179200 201600
224000 246400 268800 291200 313600 336000 358400 380800 403200 425600
448000 470400 492800 515200 537600 560000 582400 604800 627200 649600
672000 694400 716800 739200 761600 784000 806400 828800 851200 873600
896000 918400 940800 963200 985600 1008000 1030400 1052800 1075200
1097600 1120000 1142400 1164800 1187200 1209600 1232000 1254400
1276800 1299200 1321600 1344000 1366400 1388800 1411200 1433600
1456000 1478400 1500800 1523200 1545600 1568000 1590400 1612800
1635200 1657600 1680000 1702400 1724800 1747200 1769600 1343143936
types:    (81 * MPI_DOUBLE)
type 11 count ints 82 count disp 81 count datatype 81
ints:     81 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 6481
MPI_Aint: 0 22400 44800 67200 89600 112000 134400 156800 179200 201600
224000 246400 268800 291200 313600 336000 358400 380800 403200 425600
448000 470400 492800 515200 537600 560000 582400 604800 627200 649600
672000 694400 716800 739200 761600 784000 806400 828800 851200 873600
896000 918400 940800 963200 985600 1008000 1030400 1052800 1075200
1097600 1120000 1142400 1164800 1187200 1209600 1232000 1254400
1276800 1299200 1321600 1344000 1366400 1388800 1411200 1433600
1456000 1478400 1500800 1523200 1545600 1568000 1590400 1612800
1635200 1657600 1680000 1702400 1724800 1747200 1769600 1343143936
types:    (81 * MPI_DOUBLE)
...
===

Interesting, but it seems that HPL running just fine. But with this
warning messages in stdout and stderr.
Also, i've running HPL with OPENMPI 1.4 - no warning and errors.

The HPL TUNING file may help:
http://www.netlib.org/benchmark/hpl/tuning.html
Yes, it's good one!

Good luck.

My two cents,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

ilya zelenchuk wrote:
Good day, everyone!

I have problem while running HPL benchmark with OPENMPI 1.3.3.
When problem size (Ns) smaller 10000 - all is good. But when I set Ns
to 17920 (for example) - I get errors:

===
[ums1:05086] ../../ompi/datatype/datatype_pack.h:37
       Pointer 0xb27752c0 size 4032 is outside [0xb27752c0,0x10aeac8] for
       base ptr 0xb27752c0 count 1 and data
[ums1:05086] Datatype 0x83a0618[] size 5735048 align 4 id 0 length 244
used 81
true_lb 0 true_ub 1318295560 (true_extent 1318295560) lb 0 ub
1318295560 (extent 1318295560)
nbElems 716881 loops 0 flags 102 (commited )-c-----GD--[---][---]
  contain MPI_DOUBLE
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x0 (0) extent 8
(size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x11800 (71680)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x23000 (143360)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x34800 (215040)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x46000 (286720)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x57800 (358400)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x69000 (430080)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x7a800 (501760)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x8c000 (573440)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x9d800 (645120)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xaf000 (716800)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xc0800 (788480)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xd2000 (860160)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xe3800 (931840)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xf5000 (1003520)
extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x106800
(1075200) extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x118000
(1146880) extent 8 (size 71040)
--C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x129800
(1218560) extent 8 (size 71040)
....
===

Here is my HPL.dat:

===
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
17920        Ns
1            # of NBs
80           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
14           Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
===

I've run HPL with this HPL.dat by using MPICH2 - work's well.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to