Please help me resolve the following problems with a 306 node Rocks cluster 
using SGE. Please note I can run the
job successfully on <87 slots, but not anymore than that.

We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub and 
the following lines from a script:

mpirun -n $NSLOTS  -machinefile $TMPDIR/machines --mca btl_tcp_if_include eth0 
--mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca 
opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1 
/stf/billstst/ProcessColors2MPICH1
echo "MPICH1 mpirun returned #?"

eth1 is the connection to the Isilon NAS, where the object file is  located.

The error messages returned are of the form:

WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open was 
reached
WRT ORTE_ERROR_LOG: The system limit on number of network connections a process 
can open was reached in file oob_tcp.c at line 447

We have increased the open file limit to 4096 from 1024, problem still exists.

I can run the same test code via MPICH2 successfully on all 696 slots of the 
cluster, but I can't run the
same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots.

Here's the details on the installed version of Open MPI:

[root}# ./ompi_info
                 Package: Open MPI r...@build-x86-64.rocksclusters.org 
Distribution
                Open MPI: 1.3.3
   Open MPI SVN revision: r21666
   Open MPI release date: Jul 14, 2009
                Open RTE: 1.3.3
   Open RTE SVN revision: r21666
   Open RTE release date: Jul 14, 2009
                    OPAL: 1.3.3
       OPAL SVN revision: r21666
       OPAL release date: Jul 14, 2009
            Ident string: 1.3.3
                  Prefix: /opt/openmpi
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: build-x86-64.rocksclusters.org
           Configured by: root
           Configured on: Sat Dec 12 16:29:23 PST 2009
          Configure host: build-x86-64.rocksclusters.org
                Built by: bruno
                Built on: Sat Dec 12 16:42:52 PST 2009
              Built host: build-x86-64.rocksclusters.org
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
      Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: no  (checkpoint thread: no)
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
              MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
           MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
               MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
               MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
               MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
         MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
         MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
              MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
           MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
           MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
                  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
              MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
                MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
                MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
               MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
              MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
             MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
             MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)

Would upgrading to the latest version of OpenMPI (1.4.3) resolve this issue?

Thank you,

-Bill Lane
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
STRICTLY PROHIBITED. If you have received this message in error, please notify 
us immediately by calling (310) 423-6428 and destroy the related message. Thank 
You for your cooperation.
IMPORTANT WARNING:  This message is intended for the use of the person or 
entity to which it is addressed and may contain information that is privileged 
and confidential, the disclosure of which is governed by
applicable law.  If the reader of this message is not the intended recipient, 
or the employee or agent responsible for delivering it to the intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of this information is STRICTLY PROHIBITED.

If you have received this message in error, please notify us immediately
by calling (310) 423-6428 and destroy the related message.  Thank You for your 
cooperation.

Reply via email to