Please help me resolve the following problems with a 306 node Rocks cluster
using SGE. Please note I can run the
job successfully on <87 slots, but not anymore than that.
We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub and
the following lines from a script:
mpirun -n $NSLOTS -machinefile $TMPDIR/machines --mca btl_tcp_if_include eth0
--mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca
opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1
/stf/billstst/ProcessColors2MPICH1
echo "MPICH1 mpirun returned #?"
eth1 is the connection to the Isilon NAS, where the object file is located.
The error messages returned are of the form:
WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open was
reached
WRT ORTE_ERROR_LOG: The system limit on number of network connections a process
can open was reached in file oob_tcp.c at line 447
We have increased the open file limit to 4096 from 1024, problem still exists.
I can run the same test code via MPICH2 successfully on all 696 slots of the
cluster, but I can't run the
same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots.
Here's the details on the installed version of Open MPI:
[root}# ./ompi_info
Package: Open MPI [email protected]
Distribution
Open MPI: 1.3.3
Open MPI SVN revision: r21666
Open MPI release date: Jul 14, 2009
Open RTE: 1.3.3
Open RTE SVN revision: r21666
Open RTE release date: Jul 14, 2009
OPAL: 1.3.3
OPAL SVN revision: r21666
OPAL release date: Jul 14, 2009
Ident string: 1.3.3
Prefix: /opt/openmpi
Configured architecture: x86_64-unknown-linux-gnu
Configure host: build-x86-64.rocksclusters.org
Configured by: root
Configured on: Sat Dec 12 16:29:23 PST 2009
Configure host: build-x86-64.rocksclusters.org
Built by: bruno
Built on: Sat Dec 12 16:42:52 PST 2009
Built host: build-x86-64.rocksclusters.org
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fortran77 compiler: gfortran
Fortran77 compiler abs: /usr/bin/gfortran
Fortran90 compiler: gfortran
Fortran90 compiler abs: /usr/bin/gfortran
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: no, progress: no)
Sparse Groups: no
Internal debug support: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: no
mpirun default --prefix: no
MPI I/O support: yes
MPI_WTIME support: gettimeofday
Symbol visibility support: yes
FT Checkpoint support: no (checkpoint thread: no)
MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)
Would upgrading to the latest version of OpenMPI (1.4.3) resolve this issue?
Thank you,
-Bill Lane
IMPORTANT WARNING: This message is intended for the use of the person or entity
to which it is addressed and may contain information that is privileged and
confidential, the disclosure of which is governed by applicable law. If the
reader of this message is not the intended recipient, or the employee or agent
responsible for delivering it to the intended recipient, you are hereby
notified that any dissemination, distribution or copying of this information is
STRICTLY PROHIBITED. If you have received this message in error, please notify
us immediately by calling (310) 423-6428 and destroy the related message. Thank
You for your cooperation.
IMPORTANT WARNING: This message is intended for the use of the person or
entity to which it is addressed and may contain information that is privileged
and confidential, the disclosure of which is governed by
applicable law. If the reader of this message is not the intended recipient,
or the employee or agent responsible for delivering it to the intended
recipient, you are hereby notified that any dissemination, distribution or
copying of this information is STRICTLY PROHIBITED.
If you have received this message in error, please notify us immediately
by calling (310) 423-6428 and destroy the related message. Thank You for your
cooperation.