Dear all,

I have problems running large jobs on a PC cluster with OpenMPI V1.3.
Typically the error appears only for processor count >= 2048 (actually cores), sometimes also bellow.

The nodes (Intel Nehalem, 2 procs, 4 cores each) run (scientific?) linux.
$> uname -a
Linux cl3fr1 2.6.18-128.1.10.el5 #1 SMP Thu May 7 12:48:13 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

The code starts normally, reads it's input data sets (~4GB), does some initialization and then continues the actual calculations. So time after that, it fails with the following error message:

[n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory

Memory usage by the application should not be the problem. At this proc count, the code uses only ~100MB per proc. Also, the code runs for lower number of procs where it consumes more mem.


I also get the apparently secondary error messages:

[n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)


The cluster uses InfiniBand connections. I am aware only of the following parameter changes (systemwide):
btl_openib_ib_min_rnr_timer = 25
btl_openib_ib_timeout = 20

$> ulimit -l
unlimited


I attached:
1) $> ompi_info --all > ompi_info.log
2) stderr from the PBS: stderr.log


Thanks for any help you may give!

Cheers,
Jose

Attachment: ompi_info.log.gz
Description: GNU Zip compressed data

+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ module load compiler/intel mpi/openmpi/1.3-intel-11.0
++ /opt/system/modules/3.2.6/Modules/3.2.6/bin/modulecmd bash load 
compiler/intel mpi/openmpi/1.3-intel-11.0
+ eval 
LD_LIBRARY_PATH=/opt/mpi/openmpi/1.3-intel-11.0/lib:/usr/local/lib:/opt/compiler/intel//cc/11.0.074/idb/lib/intel64:/opt/compiler/intel//fc/11.0.074/lib/intel64:/opt/compiler/intel//cc/11.0.074/lib/intel64
 ';export' 
'LD_LIBRARY_PATH;LOADEDMODULES=system/maui/3.2.6p21:compiler/intel/11.0:mpi/openmpi/1.3-intel-11.0'
 ';export' 
'LOADEDMODULES;MANPATH=/usr/local/man::/opt/system/modules/default/man:/opt/compiler/intel//cc/11.0.074/man:/opt/compiler/intel//fc/11.0.074/man:/opt/mpi/openmpi/1.3-intel-11.0/man'
 ';export' 'MANPATH;MPIDIR=/opt/mpi/openmpi/1.3-intel-11.0' ';export' 
'MPIDIR;MPI_BIN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/bin' ';export' 
'MPI_BIN_DIR;MPI_INC_DIR=/opt/mpi/openmpi/1.3-intel-11.0/include' ';export' 
'MPI_INC_DIR;MPI_LIB_DIR=/opt/mpi/openmpi/1.3-intel-11.0/lib' ';export' 
'MPI_LIB_DIR;MPI_MAN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/man' ';export' 
'MPI_MAN_DIR;MPI_VERSION=1.3-intel-11.0' ';export' 
'MPI_VERSION;NLSPATH=/opt/compiler/intel//cc/11.0.074/idb/intel64/locale/%l_%t/%N'
 ';export' 
'NLSPATH;PATH=/opt/mpi/openmpi/1.3-intel-11.0/bin:/opt/compiler/intel//fc/11.0.074/bin/intel64:/opt/compiler/intel//java/jre1.6.0_14/bin:/opt/compiler/intel//cc/11.0.074/bin/intel64:/nfs/home4/HLRS/hlrs/hpcjgrac/bin:/usr/local/bin:/usr/lib64/qt-3.3/bin:/opt/system/maui/3.2.6p21/bin:/usr/kerberos/bin:/bin:/usr/bin'
 ';export' 
'PATH;_LMFILES_=/opt/system/modulefiles/system/maui/3.2.6p21:/opt/modulefiles/compiler/intel/11.0:/opt/modulefiles/mpi/openmpi/1.3-intel-11.0'
 ';export' '_LMFILES_;'
++ 
LD_LIBRARY_PATH=/opt/mpi/openmpi/1.3-intel-11.0/lib:/usr/local/lib:/opt/compiler/intel//cc/11.0.074/idb/lib/intel64:/opt/compiler/intel//fc/11.0.074/lib/intel64:/opt/compiler/intel//cc/11.0.074/lib/intel64
++ export LD_LIBRARY_PATH
++ 
LOADEDMODULES=system/maui/3.2.6p21:compiler/intel/11.0:mpi/openmpi/1.3-intel-11.0
++ export LOADEDMODULES
++ 
MANPATH=/usr/local/man::/opt/system/modules/default/man:/opt/compiler/intel//cc/11.0.074/man:/opt/compiler/intel//fc/11.0.074/man:/opt/mpi/openmpi/1.3-intel-11.0/man
++ export MANPATH
++ MPIDIR=/opt/mpi/openmpi/1.3-intel-11.0
++ export MPIDIR
++ MPI_BIN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/bin
++ export MPI_BIN_DIR
++ MPI_INC_DIR=/opt/mpi/openmpi/1.3-intel-11.0/include
++ export MPI_INC_DIR
++ MPI_LIB_DIR=/opt/mpi/openmpi/1.3-intel-11.0/lib
++ export MPI_LIB_DIR
++ MPI_MAN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/man
++ export MPI_MAN_DIR
++ MPI_VERSION=1.3-intel-11.0
++ export MPI_VERSION
++ NLSPATH=/opt/compiler/intel//cc/11.0.074/idb/intel64/locale/%l_%t/%N
++ export NLSPATH
++ 
PATH=/opt/mpi/openmpi/1.3-intel-11.0/bin:/opt/compiler/intel//fc/11.0.074/bin/intel64:/opt/compiler/intel//java/jre1.6.0_14/bin:/opt/compiler/intel//cc/11.0.074/bin/intel64:/nfs/home4/HLRS/hlrs/hpcjgrac/bin:/usr/local/bin:/usr/lib64/qt-3.3/bin:/opt/system/maui/3.2.6p21/bin:/usr/kerberos/bin:/bin:/usr/bin
++ export PATH
++ 
_LMFILES_=/opt/system/modulefiles/system/maui/3.2.6p21:/opt/modulefiles/compiler/intel/11.0:/opt/modulefiles/mpi/openmpi/1.3-intel-11.0
++ export _LMFILES_
+ module list
++ /opt/system/modules/3.2.6/Modules/3.2.6/bin/modulecmd bash list
Currently Loaded Modulefiles:
  1) system/maui/3.2.6p21         3) mpi/openmpi/1.3-intel-11.0
  2) compiler/intel/11.0
+ eval
+ cd 
/nfs/nas/homeB/home4/HLRS/hlrs/hpcjgrac/prace/benchmark/applications/gadget/tmp/GADGET_NEHALEM-HLRS_StrongScaling_2048_i000083/n256p8t1_t001_i01
++ date
+ echo '<jobstart at="Fri Jun 19 09:50:05 CEST 2009" />'
+ mpiexec time 
/nfs/nas/homeB/home4/HLRS/hlrs/hpcjgrac/prace/benchmark/applications/gadget/tmp/GADGET_NEHALEM-HLRS_StrongScaling_2048_i000083/n256p8t1_t001_i01/GADGET_NEHALEM-HLRS_cname_NEHALEM-HLRS.exe
 param.txt
[n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],5][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],5][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n100501][[40339,1],1][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],1][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n100501][[40339,1],3][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],3][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n100501][[40339,1],4][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],4][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n100501][[40339,1],7][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n100501][[40339,1],7][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[n100501:14587] [[40339,0],0]-[[40339,1],7] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[n100501:14587] [[40339,0],0]-[[40339,1],6] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[n100501:14587] [[40339,0],0]-[[40339,1],5] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[n100501:14587] [[40339,0],0]-[[40339,1],1] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[n100501:14587] [[40339,0],0]-[[40339,1],2] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[n100501:14587] [[40339,0],0]-[[40339,1],3] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
[n033201][[40339,1],1551][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1551][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n033201][[40339,1],1547][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201:3588] *** An error occurred in MPI_Sendrecv
[n033201:3588] *** on communicator MPI_COMM_WORLD
[n033201:3588] *** MPI_ERR_OTHER: known error not in list
[n033201:3588] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[n033102][[40339,1],1538][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033102][[40339,1],1543][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1549][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1545][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1545][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n033102][[40339,1],1540][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033102][[40339,1],1540][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n033102][[40339,1],1541][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033102][[40339,1],1536][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1544][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1550][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1550][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n033201][[40339,1],1548][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1548][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n033201][[40339,1],1546][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201][[40339,1],1546][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n033202][[40339,1],1553][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033202][[40339,1],1555][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033202][[40339,1],1555][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n033202][[40339,1],1556][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033202][[40339,1],1552][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033202][[40339,1],1552][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:804:rml_recv_cb]
 error in endpoint reply start connect
[n033202][[40339,1],1558][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033202][[40339,1],1559][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033202][[40339,1],1557][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:459:qp_create_one]
 error creating qp errno says Cannot allocate memory
[n033201:03576] [[40339,0],193]-[[40339,1],1544] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
[n033102:03498] [[40339,0],192]-[[40339,1],1538] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
[n033102:03498] [[40339,0],192]-[[40339,1],1543] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
[n033201:03576] [[40339,0],193]-[[40339,1],1551] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
[n033102:03498] [[40339,0],192]-[[40339,1],1540] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
[n033201:03576] [[40339,0],193]-[[40339,1],1549] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
[n033202:03719] [[40339,0],194]-[[40339,1],1555] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
[n033202:03719] [[40339,0],194]-[[40339,1],1552] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
Command exited with non-zero status 16
64.36user 3.48system 1:20.39elapsed 84%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (7major+125286minor)pagefaults 0swaps
--------------------------------------------------------------------------
mpiexec has exited due to process rank 1538 with PID 3501 on
node n033102 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[n100501:14587] 11 more processes have sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[n100501:14587] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages
++ date
+ echo '<jobend at="Fri Jun 19 09:51:27 CEST 2009" />'

Reply via email to