Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in cluster, but nothing more than that

Ralph Castain Tue, 26 Jul 2011 15:51:40 -0400

On Jul 26, 2011, at 1:39 PM, Reuti wrote:

> Hi,
> 
> Am 26.07.2011 um 21:19 schrieb Lane, William:
> 
>> I can successfully run the MPI testcode via OpenMPI 1.3.3 on less than 87 
>> slots w/both the btl_tcp_if_exclude and btl_tcp_if_include switches
>> passed to mpirun.
>> 
>> SGE always allocates the qsub jobs from the 24 slot nodes first -- up to the 
>> 96 slots that these 4 nodes have available (on the largeMem.q). The rest of 
>> the 602 slots are allocated
>> from 2 slot nodes (all.q). All requests of up to 96 slots are serviced by 
>> the largeMem.q nodes (which have 24 slots apiece). Anything over 96 slots is 
>> serviced first by the largeMem.q
>> nodes then by the all.q nodes.
> 
> did you setup a JSV for it, as PE have no sequence numbers in case a PE is 
> requested?
> 
> 
>> Here's the PE that I'm using:
>> 
>> mpich PE (Parallel Environment) queue:
>> 
>> pe_name            mpich
>> slots              9999
>> user_lists         NONE
>> xuser_lists        NONE
>> start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
>> stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
> 
> These both above can be set to NONE when you compiled Open MPI with SGE 
> integration --with-sge
> 
> NB: what is defined in rsh_daemon/rsh_command in `qconf -sconf`?
> 
> 
>> allocation_rule    $fill_up
> 
> Here you specify to fill one machine after the other completely before 
> gathering slots from the next machine. You can change this to $round_robin to 
> get one slot form each node before taking a second from particular machines. 
> If you prefer a fixed allocation, you could also put an integer here.


Remember, OMPI only uses sge to launch one daemon/node. The placement of MPI 
procs is totally up to mpirun itself, which doesn't look at any SGE envar.

> 
> 
>> control_slaves     TRUE
>> job_is_first_task  FALSE
>> urgency_slots      min
>> accounting_summary TRUE
>> 
>> Wouldn't the -bynode allocation be really inefficient? Does the -bynode 
>> switch imply only one slot is used on each node before it moves on to the 
>> next?
> 
> Do I get it right: inside the granted slots by SGE you want the allocation 
> inside Open MPI to follow a specific pattern, i.e.: which rank is where?
> 
> -- Reuti
> 
> 
>> 
>> Thanks for your help Ralph. At least I have some ideas on where to look now.
>> 
>> -Bill
>> ________________________________________
>> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of 
>> Ralph Castain [r...@open-mpi.org]
>> Sent: Tuesday, July 26, 2011 6:32 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in   
>>    cluster, but nothing more than that
>> 
>> A few thoughts:
>> 
>> * including both btl_tcp_if_include and btl_tcp_if_exclude is problematic as 
>> they are mutually exclusive options. I'm not sure which one will take 
>> precedence. I would suggest only using one of them.
>> 
>> * the default mapping algorithm is byslot - i.e., OMPI will place procs on 
>> each node of the cluster until all slots on that node have been filled, and 
>> then moves to the next node. Depending on what you have in your machinefile, 
>> it is possible that all 88 procs are being placed on the first node. You 
>> might try spreading your procs across all nodes with -bynode on the cmd 
>> line, or check to ensure that the machinefile is correctly specifying the 
>> number of slots on each node. Note: OMPI will automatically read the SGE 
>> environment to get the host allocation, so the only reason for providing a 
>> machinefile is if you don't want the full allocation used.
>> 
>> * 88*88 = 7744. MPI transport connections are point-to-point - i.e., each 
>> proc opens a unique connection to another proc. If your procs are all 
>> winding up on the same node, for example, then the system will want at least 
>> 7744 file descriptors on that node, assuming your application does a 
>> complete wireup across all procs.
>> 
>> Updating to 1.4.3 would be a good idea as it is more stable, but it may not 
>> resolve this problem if the issue is one of the above.
>> 
>> HTH
>> Ralph
>> 
>> 
>> On Jul 25, 2011, at 11:23 PM, Lane, William wrote:
>> 
>>> Please help me resolve the following problems with a 306 node Rocks cluster 
>>> using SGE. Please note I can run the
>>> job successfully on <87 slots, but not anymore than that.
>>> 
>>> We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub 
>>> and the following lines from a script:
>>> 
>>> mpirun -n $NSLOTS  -machinefile $TMPDIR/machines --mca btl_tcp_if_include 
>>> eth0 --mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca 
>>> opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1 
>>> /stf/billstst/ProcessColors2MPICH1
>>> echo "MPICH1 mpirun returned #?"
>>> 
>>> eth1 is the connection to the Isilon NAS, where the object file is  located.
>>> 
>>> The error messages returned are of the form:
>>> 
>>> WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open 
>>> was reached
>>> WRT ORTE_ERROR_LOG: The system limit on number of network connections a 
>>> process can open was reached in file oob_tcp.c at line 447
>>> 
>>> We have increased the open file limit to 4096 from 1024, problem still 
>>> exists.
>>> 
>>> I can run the same test code via MPICH2 successfully on all 696 slots of 
>>> the cluster, but I can't run the
>>> same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots.
>>> 
>>> Here's the details on the installed version of Open MPI:
>>> 
>>> [root}# ./ompi_info
>>>               Package: Open MPI r...@build-x86-64.rocksclusters.org 
>>> Distribution
>>>              Open MPI: 1.3.3
>>> Open MPI SVN revision: r21666
>>> Open MPI release date: Jul 14, 2009
>>>              Open RTE: 1.3.3
>>> Open RTE SVN revision: r21666
>>> Open RTE release date: Jul 14, 2009
>>>                  OPAL: 1.3.3
>>>     OPAL SVN revision: r21666
>>>     OPAL release date: Jul 14, 2009
>>>          Ident string: 1.3.3
>>>                Prefix: /opt/openmpi
>>> Configured architecture: x86_64-unknown-linux-gnu
>>>        Configure host: build-x86-64.rocksclusters.org
>>>         Configured by: root
>>>         Configured on: Sat Dec 12 16:29:23 PST 2009
>>>        Configure host: build-x86-64.rocksclusters.org
>>>              Built by: bruno
>>>              Built on: Sat Dec 12 16:42:52 PST 2009
>>>            Built host: build-x86-64.rocksclusters.org
>>>            C bindings: yes
>>>          C++ bindings: yes
>>>    Fortran77 bindings: yes (all)
>>>    Fortran90 bindings: yes
>>> Fortran90 bindings size: small
>>>            C compiler: gcc
>>>   C compiler absolute: /usr/bin/gcc
>>>          C++ compiler: g++
>>> C++ compiler absolute: /usr/bin/g++
>>>    Fortran77 compiler: gfortran
>>> Fortran77 compiler abs: /usr/bin/gfortran
>>>    Fortran90 compiler: gfortran
>>> Fortran90 compiler abs: /usr/bin/gfortran
>>>           C profiling: yes
>>>         C++ profiling: yes
>>>   Fortran77 profiling: yes
>>>   Fortran90 profiling: yes
>>>        C++ exceptions: no
>>>        Thread support: posix (mpi: no, progress: no)
>>>         Sparse Groups: no
>>> Internal debug support: no
>>>   MPI parameter check: runtime
>>> Memory profiling support: no
>>> Memory debugging support: no
>>>       libltdl support: yes
>>> Heterogeneous support: no
>>> mpirun default --prefix: no
>>>       MPI I/O support: yes
>>>     MPI_WTIME support: gettimeofday
>>> Symbol visibility support: yes
>>> FT Checkpoint support: no  (checkpoint thread: no)
>>>         MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
>>>            MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
>>>         MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
>>>         MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
>>>       MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
>>>       MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
>>>            MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
>>>         MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
>>>         MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
>>>                MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
>>>            MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
>>>              MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
>>>            MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
>>>            MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
>>>            MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>>             MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
>>>            MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>>               MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
>>>           MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
>>>           MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)
>>> 
>>> Would upgrading to the latest version of OpenMPI (1.4.3) resolve this issue?
>>> 
>>> Thank you,
>>> 
>>> -Bill Lane
>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by 
>>> applicable law. If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>> have received this message in error, please notify us immediately by 
>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>> cooperation.
>>> IMPORTANT WARNING:  This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by
>>> applicable law.  If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is STRICTLY PROHIBITED.
>>> 
>>> If you have received this message in error, please notify us immediately
>>> by calling (310) 423-6428 and destroy the related message.  Thank You for 
>>> your cooperation.
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> IMPORTANT WARNING: This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by 
>> applicable law. If the reader of this message is not the intended recipient, 
>> or the employee or agent responsible for delivering it to the intended 
>> recipient, you are hereby notified that any dissemination, distribution or 
>> copying of this information is STRICTLY PROHIBITED. If you have received 
>> this message in error, please notify us immediately by calling (310) 
>> 423-6428 and destroy the related message. Thank You for your cooperation.
>> IMPORTANT WARNING:  This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by
>> applicable law.  If the reader of this message is not the intended 
>> recipient, or the employee or agent responsible for delivering it to the 
>> intended recipient, you are hereby notified that any dissemination, 
>> distribution or copying of this information is STRICTLY PROHIBITED.
>> 
>> If you have received this message in error, please notify us immediately
>> by calling (310) 423-6428 and destroy the related message.  Thank You for 
>> your cooperation.
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in cluster, but nothing more than that

Reply via email to