Hey! 2009/12/29 Gus Correa <g...@ldeo.columbia.edu>: > Hi Ilya > > OK, with 28 nodes and 4GB/node, > you have much more memory than I thought. > The maximum N is calculated based on the total memory > you have (assuming the nodes are homogeneous, have the same RAM), > not based on the memory per node. Yep. I know. I've playing with HPL when thus errors came up.
> I haven't tried OpenMPI 1.3.3. > Last I ran HPL was with OpenMPI 1.3.2. > It worked fine. > It also worked with OpenMPI 1.3.1 and 1.3.0, but these versions > had a problem that caused memory leaks (at least on Infiniband, not sure > about Ethernet). > The problem was fixed in later OpenMPI versions (1.3.2. and newer). > In any case, there was even a workaround in the command line for that > ("-mca mpi_leave_pinned 0") for 1.3.0 and 1.3.1. > However, AFAIK, this workaround is not needed for 1.3.2 and newer. > What is your OpenMPI mpiexec command line? $MPIRUN_HOME/mpirun --prefix $MPIR_HOME --hostfile $mf -np $1 $progname $cmdLineArgs Where: MPIR_HOME=/opt/openmpi/intel MPIRUN_HOME=$MPIR_HOME/bin $mf - is generated automatically by our task scheduler $1 - number of proc from user $progname - full path to the binary $cmdLineArgs - ARGV > Is it possible that you somehow mixed a 32-bit machine/OpenMPI build > with a 64-bit machine/OpenMPI build. > For instance, your head node (where you compiled the code) is > 64-bit, but the compute nodes - or some of them - are 32-bit, > or vice-versa? > The error messages you posted hint something like that, > a mix of MPI_DOUBLE types, MPI_Aint types, etc. No, all nodes (master, worker) is Xeon 2.4 GHz, 32-bit. > Also, make sure no mpi.h/mpif.h is hardwired into your HPL code, > or into the supporting libraries BLAS/LAPACK, or Goto BLAS, > or ATLAS, etc. > Those include files are NOT portable across MPI flavors, > and a source of frustration when hardwired into the code. Ok > Furthermore, make sure you don't have leftover HPL processes > from old runs hanging on the compute nodes. > That is a common cause of trouble. No, there is no other tasks on nodes. > Would this be the reason for the problems you saw? :) I have the same OpenMPI on other cluster and it's work fine. Maybe I have some prob. while compile OpenMPI? Need some time to check it. > Good luck. > > I hope it helps. > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > ilya zelenchuk wrote: >> >> Hello, Gus! >> >> Sorry for the lack of debug info. >> I have 28 nodes. Each node have 2 processors Xeon 2.4 GHz with 4 Gb RAM. >> OpenMPI 1.3.3 was compiled as: >> CC=icc CFLAGS=" -O3" CXX=icpc CXXFLAGS=" -O3" F77=ifort FFLAGS=" -O3" >> FC=ifort FCFLAGS=" -O3" ./configure --prefix=/opt/openmpi/intel/ >> --enable-debug --enable-mpi-threads --disable-ipv6 >> >> 2009/12/28 Gus Correa <g...@ldeo.columbia.edu>: >>> >>> Hi Ilya >>> >>> Did you recompile HPL with OpenMPI, or just launched the MPICH2 >>> executable with the OpenMPI mpiexec? >>> You probably know this, but you cannot mix different MPIs at >>> compile and run time. >> >> Yes, I know this bug. I compile HPL with OpenMPI and run with this one. >> For the MPICH2, I recompile HPL. >> >>> Also, the HPL maximum problem size (N) depends on how much >>> memory/RAM you have. >>> If you make N too big, the arrays don't fit in the RAM, >>> you get into memory paging, which is no good for MPI. >>> How much RAM do you have? >> >> 4 Gb on each node. Also, I've watched meminfo through the top. >> No swap. >> >>> N=17920 would require about 3.2GB, if I did the math right. >>> A rule of thumb is maximum N = sqrt(0.8 * RAM_in_bytes / 8) >>> Have you tried smaller values (above 10000, but below 17920)? >>> For which N does it start to break? >> >> With 8960 work's almost fine. I have same errors just once. They >> disappear after rebooting cluster :) >> But if I set problem size to 11200 - errors comes again. At this point >> rebooting doesn't help. >> >> BTW: in output i have: >> >> === >> type 11 count ints 82 count disp 81 count datatype 81 >> ints: 81 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 6481 >> MPI_Aint: 0 22400 44800 67200 89600 112000 134400 156800 179200 201600 >> 224000 246400 268800 291200 313600 336000 358400 380800 403200 425600 >> 448000 470400 492800 515200 537600 560000 582400 604800 627200 649600 >> 672000 694400 716800 739200 761600 784000 806400 828800 851200 873600 >> 896000 918400 940800 963200 985600 1008000 1030400 1052800 1075200 >> 1097600 1120000 1142400 1164800 1187200 1209600 1232000 1254400 >> 1276800 1299200 1321600 1344000 1366400 1388800 1411200 1433600 >> 1456000 1478400 1500800 1523200 1545600 1568000 1590400 1612800 >> 1635200 1657600 1680000 1702400 1724800 1747200 1769600 1343143936 >> types: (81 * MPI_DOUBLE) >> type 11 count ints 82 count disp 81 count datatype 81 >> ints: 81 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 >> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 6481 >> MPI_Aint: 0 22400 44800 67200 89600 112000 134400 156800 179200 201600 >> 224000 246400 268800 291200 313600 336000 358400 380800 403200 425600 >> 448000 470400 492800 515200 537600 560000 582400 604800 627200 649600 >> 672000 694400 716800 739200 761600 784000 806400 828800 851200 873600 >> 896000 918400 940800 963200 985600 1008000 1030400 1052800 1075200 >> 1097600 1120000 1142400 1164800 1187200 1209600 1232000 1254400 >> 1276800 1299200 1321600 1344000 1366400 1388800 1411200 1433600 >> 1456000 1478400 1500800 1523200 1545600 1568000 1590400 1612800 >> 1635200 1657600 1680000 1702400 1724800 1747200 1769600 1343143936 >> types: (81 * MPI_DOUBLE) >> ... >> === >> >> Interesting, but it seems that HPL running just fine. But with this >> warning messages in stdout and stderr. >> Also, i've running HPL with OPENMPI 1.4 - no warning and errors. >> >>> The HPL TUNING file may help: >>> http://www.netlib.org/benchmark/hpl/tuning.html >> >> Yes, it's good one! >> >>> Good luck. >>> >>> My two cents, >>> Gus Correa >>> --------------------------------------------------------------------- >>> Gustavo Correa >>> Lamont-Doherty Earth Observatory - Columbia University >>> Palisades, NY, 10964-8000 - USA >>> --------------------------------------------------------------------- >>> >>> ilya zelenchuk wrote: >>>> >>>> Good day, everyone! >>>> >>>> I have problem while running HPL benchmark with OPENMPI 1.3.3. >>>> When problem size (Ns) smaller 10000 - all is good. But when I set Ns >>>> to 17920 (for example) - I get errors: >>>> >>>> === >>>> [ums1:05086] ../../ompi/datatype/datatype_pack.h:37 >>>> Pointer 0xb27752c0 size 4032 is outside [0xb27752c0,0x10aeac8] for >>>> base ptr 0xb27752c0 count 1 and data >>>> [ums1:05086] Datatype 0x83a0618[] size 5735048 align 4 id 0 length 244 >>>> used 81 >>>> true_lb 0 true_ub 1318295560 (true_extent 1318295560) lb 0 ub >>>> 1318295560 (extent 1318295560) >>>> nbElems 716881 loops 0 flags 102 (commited )-c-----GD--[---][---] >>>> contain MPI_DOUBLE >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x0 (0) extent 8 >>>> (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x11800 (71680) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x23000 (143360) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x34800 (215040) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x46000 (286720) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x57800 (358400) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x69000 (430080) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x7a800 (501760) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x8c000 (573440) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x9d800 (645120) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xaf000 (716800) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xc0800 (788480) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xd2000 (860160) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xe3800 (931840) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xf5000 (1003520) >>>> extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x106800 >>>> (1075200) extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x118000 >>>> (1146880) extent 8 (size 71040) >>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x129800 >>>> (1218560) extent 8 (size 71040) >>>> .... >>>> === >>>> >>>> Here is my HPL.dat: >>>> >>>> === >>>> HPLinpack benchmark input file >>>> Innovative Computing Laboratory, University of Tennessee >>>> HPL.out output file name (if any) >>>> 6 device out (6=stdout,7=stderr,file) >>>> 1 # of problems sizes (N) >>>> 17920 Ns >>>> 1 # of NBs >>>> 80 NBs >>>> 0 PMAP process mapping (0=Row-,1=Column-major) >>>> 1 # of process grids (P x Q) >>>> 2 Ps >>>> 14 Qs >>>> 16.0 threshold >>>> 1 # of panel fact >>>> 2 PFACTs (0=left, 1=Crout, 2=Right) >>>> 1 # of recursive stopping criterium >>>> 4 NBMINs (>= 1) >>>> 1 # of panels in recursion >>>> 2 NDIVs >>>> 1 # of recursive panel fact. >>>> 2 RFACTs (0=left, 1=Crout, 2=Right) >>>> 1 # of broadcast >>>> 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) >>>> 1 # of lookahead depth >>>> 1 DEPTHs (>=0) >>>> 2 SWAP (0=bin-exch,1=long,2=mix) >>>> 64 swapping threshold >>>> 0 L1 in (0=transposed,1=no-transposed) form >>>> 0 U in (0=transposed,1=no-transposed) form >>>> 1 Equilibration (0=no,1=yes) >>>> 8 memory alignment in double (> 0) >>>> === >>>> >>>> I've run HPL with this HPL.dat by using MPICH2 - work's well. >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >