Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in cluster, but nothing more than that

Reuti Tue, 26 Jul 2011 15:40:05 -0400

Hi,

Am 26.07.2011 um 21:19 schrieb Lane, William:


> I can successfully run the MPI testcode via OpenMPI 1.3.3 on less than 87 
> slots w/both the btl_tcp_if_exclude and btl_tcp_if_include switches
> passed to mpirun.
> 
> SGE always allocates the qsub jobs from the 24 slot nodes first -- up to the 
> 96 slots that these 4 nodes have available (on the largeMem.q). The rest of 
> the 602 slots are allocated
> from 2 slot nodes (all.q). All requests of up to 96 slots are serviced by the 
> largeMem.q nodes (which have 24 slots apiece). Anything over 96 slots is 
> serviced first by the largeMem.q
> nodes then by the all.q nodes.

did you setup a JSV for it, as PE have no sequence numbers in case a PE is 
requested?


> Here's the PE that I'm using:
> 
> mpich PE (Parallel Environment) queue:
> 
> pe_name            mpich
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args     /opt/gridengine/mpi/stopmpi.sh

These both above can be set to NONE when you compiled Open MPI with SGE 
integration --with-sge

NB: what is defined in rsh_daemon/rsh_command in `qconf -sconf`?


> allocation_rule    $fill_up

Here you specify to fill one machine after the other completely before 
gathering slots from the next machine. You can change this to $round_robin to 
get one slot form each node before taking a second from particular machines. If 
you prefer a fixed allocation, you could also put an integer here.


> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary TRUE
> 
> Wouldn't the -bynode allocation be really inefficient? Does the -bynode 
> switch imply only one slot is used on each node before it moves on to the 
> next?

Do I get it right: inside the granted slots by SGE you want the allocation 
inside Open MPI to follow a specific pattern, i.e.: which rank is where?

-- Reuti


> 
> Thanks for your help Ralph. At least I have some ideas on where to look now.
> 
> -Bill
> ________________________________________
> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of 
> Ralph Castain [r...@open-mpi.org]
> Sent: Tuesday, July 26, 2011 6:32 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in    
>   cluster, but nothing more than that
> 
> A few thoughts:
> 
> * including both btl_tcp_if_include and btl_tcp_if_exclude is problematic as 
> they are mutually exclusive options. I'm not sure which one will take 
> precedence. I would suggest only using one of them.
> 
> * the default mapping algorithm is byslot - i.e., OMPI will place procs on 
> each node of the cluster until all slots on that node have been filled, and 
> then moves to the next node. Depending on what you have in your machinefile, 
> it is possible that all 88 procs are being placed on the first node. You 
> might try spreading your procs across all nodes with -bynode on the cmd line, 
> or check to ensure that the machinefile is correctly specifying the number of 
> slots on each node. Note: OMPI will automatically read the SGE environment to 
> get the host allocation, so the only reason for providing a machinefile is if 
> you don't want the full allocation used.
> 
> * 88*88 = 7744. MPI transport connections are point-to-point - i.e., each 
> proc opens a unique connection to another proc. If your procs are all winding 
> up on the same node, for example, then the system will want at least 7744 
> file descriptors on that node, assuming your application does a complete 
> wireup across all procs.
> 
> Updating to 1.4.3 would be a good idea as it is more stable, but it may not 
> resolve this problem if the issue is one of the above.
> 
> HTH
> Ralph
> 
> 
> On Jul 25, 2011, at 11:23 PM, Lane, William wrote:
> 
>> Please help me resolve the following problems with a 306 node Rocks cluster 
>> using SGE. Please note I can run the
>> job successfully on <87 slots, but not anymore than that.
>> 
>> We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub 
>> and the following lines from a script:
>> 
>> mpirun -n $NSLOTS  -machinefile $TMPDIR/machines --mca btl_tcp_if_include 
>> eth0 --mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca 
>> opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1 
>> /stf/billstst/ProcessColors2MPICH1
>> echo "MPICH1 mpirun returned #?"
>> 
>> eth1 is the connection to the Isilon NAS, where the object file is  located.
>> 
>> The error messages returned are of the form:
>> 
>> WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open 
>> was reached
>> WRT ORTE_ERROR_LOG: The system limit on number of network connections a 
>> process can open was reached in file oob_tcp.c at line 447
>> 
>> We have increased the open file limit to 4096 from 1024, problem still 
>> exists.
>> 
>> I can run the same test code via MPICH2 successfully on all 696 slots of the 
>> cluster, but I can't run the
>> same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots.
>> 
>> Here's the details on the installed version of Open MPI:
>> 
>> [root}# ./ompi_info
>>                Package: Open MPI r...@build-x86-64.rocksclusters.org 
>> Distribution
>>               Open MPI: 1.3.3
>>  Open MPI SVN revision: r21666
>>  Open MPI release date: Jul 14, 2009
>>               Open RTE: 1.3.3
>>  Open RTE SVN revision: r21666
>>  Open RTE release date: Jul 14, 2009
>>                   OPAL: 1.3.3
>>      OPAL SVN revision: r21666
>>      OPAL release date: Jul 14, 2009
>>           Ident string: 1.3.3
>>                 Prefix: /opt/openmpi
>> Configured architecture: x86_64-unknown-linux-gnu
>>         Configure host: build-x86-64.rocksclusters.org
>>          Configured by: root
>>          Configured on: Sat Dec 12 16:29:23 PST 2009
>>         Configure host: build-x86-64.rocksclusters.org
>>               Built by: bruno
>>               Built on: Sat Dec 12 16:42:52 PST 2009
>>             Built host: build-x86-64.rocksclusters.org
>>             C bindings: yes
>>           C++ bindings: yes
>>     Fortran77 bindings: yes (all)
>>     Fortran90 bindings: yes
>> Fortran90 bindings size: small
>>             C compiler: gcc
>>    C compiler absolute: /usr/bin/gcc
>>           C++ compiler: g++
>>  C++ compiler absolute: /usr/bin/g++
>>     Fortran77 compiler: gfortran
>> Fortran77 compiler abs: /usr/bin/gfortran
>>     Fortran90 compiler: gfortran
>> Fortran90 compiler abs: /usr/bin/gfortran
>>            C profiling: yes
>>          C++ profiling: yes
>>    Fortran77 profiling: yes
>>    Fortran90 profiling: yes
>>         C++ exceptions: no
>>         Thread support: posix (mpi: no, progress: no)
>>          Sparse Groups: no
>> Internal debug support: no
>>    MPI parameter check: runtime
>> Memory profiling support: no
>> Memory debugging support: no
>>        libltdl support: yes
>>  Heterogeneous support: no
>> mpirun default --prefix: no
>>        MPI I/O support: yes
>>      MPI_WTIME support: gettimeofday
>> Symbol visibility support: yes
>>  FT Checkpoint support: no  (checkpoint thread: no)
>>          MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
>>             MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
>>          MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
>>          MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
>>        MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
>>        MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
>>             MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
>>          MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
>>          MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
>>                 MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
>>             MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
>>               MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
>>             MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
>>             MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
>>             MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>              MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
>>             MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>                MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
>>            MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
>>            MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)
>> 
>> Would upgrading to the latest version of OpenMPI (1.4.3) resolve this issue?
>> 
>> Thank you,
>> 
>> -Bill Lane
>> IMPORTANT WARNING: This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by 
>> applicable law. If the reader of this message is not the intended recipient, 
>> or the employee or agent responsible for delivering it to the intended 
>> recipient, you are hereby notified that any dissemination, distribution or 
>> copying of this information is STRICTLY PROHIBITED. If you have received 
>> this message in error, please notify us immediately by calling (310) 
>> 423-6428 and destroy the related message. Thank You for your cooperation.
>> IMPORTANT WARNING:  This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by
>> applicable law.  If the reader of this message is not the intended 
>> recipient, or the employee or agent responsible for delivering it to the 
>> intended recipient, you are hereby notified that any dissemination, 
>> distribution or copying of this information is STRICTLY PROHIBITED.
>> 
>> If you have received this message in error, please notify us immediately
>> by calling (310) 423-6428 and destroy the related message.  Thank You for 
>> your cooperation.
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is STRICTLY PROHIBITED. If you have received this 
> message in error, please notify us immediately by calling (310) 423-6428 and 
> destroy the related message. Thank You for your cooperation.
> IMPORTANT WARNING:  This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by
> applicable law.  If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is STRICTLY PROHIBITED.
> 
> If you have received this message in error, please notify us immediately
> by calling (310) 423-6428 and destroy the related message.  Thank You for 
> your cooperation.
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in cluster, but nothing more than that

Reply via email to