Please help me resolve the following problems with a 306 node Rocks cluster using SGE. Please note I can run the job successfully on <87 slots, but not anymore than that.
We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub and the following lines from a script: mpirun -n $NSLOTS -machinefile $TMPDIR/machines --mca btl_tcp_if_include eth0 --mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1 /stf/billstst/ProcessColors2MPICH1 echo "MPICH1 mpirun returned #?" eth1 is the connection to the Isilon NAS, where the object file is located. The error messages returned are of the form: WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open was reached WRT ORTE_ERROR_LOG: The system limit on number of network connections a process can open was reached in file oob_tcp.c at line 447 We have increased the open file limit to 4096 from 1024, problem still exists. I can run the same test code via MPICH2 successfully on all 696 slots of the cluster, but I can't run the same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots. Here's the details on the installed version of Open MPI: [root}# ./ompi_info Package: Open MPI r...@build-x86-64.rocksclusters.org Distribution Open MPI: 1.3.3 Open MPI SVN revision: r21666 Open MPI release date: Jul 14, 2009 Open RTE: 1.3.3 Open RTE SVN revision: r21666 Open RTE release date: Jul 14, 2009 OPAL: 1.3.3 OPAL SVN revision: r21666 OPAL release date: Jul 14, 2009 Ident string: 1.3.3 Prefix: /opt/openmpi Configured architecture: x86_64-unknown-linux-gnu Configure host: build-x86-64.rocksclusters.org Configured by: root Configured on: Sat Dec 12 16:29:23 PST 2009 Configure host: build-x86-64.rocksclusters.org Built by: bruno Built on: Sat Dec 12 16:42:52 PST 2009 Built host: build-x86-64.rocksclusters.org C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: gfortran Fortran77 compiler abs: /usr/bin/gfortran Fortran90 compiler: gfortran Fortran90 compiler abs: /usr/bin/gfortran C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: no, progress: no) Sparse Groups: no Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: no mpirun default --prefix: no MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol visibility support: yes FT Checkpoint support: no (checkpoint thread: no) MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3) MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3) MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3) MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3) MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3) MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3) MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3) MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3) MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3) MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3) MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3) MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3) MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3) MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3) MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3) MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3) MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3) MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3) MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3) MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3) MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3) MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3) MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3) MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3) MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3) MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3) MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3) MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3) MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3) MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3) MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3) MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3) MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3) MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3) MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3) MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3) MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3) MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3) MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3) MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3) MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3) MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3) MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3) MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3) MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3) MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3) MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3) MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3) MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3) MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3) MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3) MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3) Would upgrading to the latest version of OpenMPI (1.4.3) resolve this issue? Thank you, -Bill Lane IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation. IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation.