Am 26.07.2011 um 21:51 schrieb Ralph Castain:

> 
> On Jul 26, 2011, at 1:39 PM, Reuti wrote:
> 
>> Hi,
>> 
>> Am 26.07.2011 um 21:19 schrieb Lane, William:
>> 
>>> I can successfully run the MPI testcode via OpenMPI 1.3.3 on less than 87 
>>> slots w/both the btl_tcp_if_exclude and btl_tcp_if_include switches
>>> passed to mpirun.
>>> 
>>> SGE always allocates the qsub jobs from the 24 slot nodes first -- up to 
>>> the 96 slots that these 4 nodes have available (on the largeMem.q). The 
>>> rest of the 602 slots are allocated
>>> from 2 slot nodes (all.q). All requests of up to 96 slots are serviced by 
>>> the largeMem.q nodes (which have 24 slots apiece). Anything over 96 slots 
>>> is serviced first by the largeMem.q
>>> nodes then by the all.q nodes.
>> 
>> did you setup a JSV for it, as PE have no sequence numbers in case a PE is 
>> requested?
>> 
>> 
>>> Here's the PE that I'm using:
>>> 
>>> mpich PE (Parallel Environment) queue:
>>> 
>>> pe_name            mpich
>>> slots              9999
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
>>> stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
>> 
>> These both above can be set to NONE when you compiled Open MPI with SGE 
>> integration --with-sge
>> 
>> NB: what is defined in rsh_daemon/rsh_command in `qconf -sconf`?
>> 
>> 
>>> allocation_rule    $fill_up
>> 
>> Here you specify to fill one machine after the other completely before 
>> gathering slots from the next machine. You can change this to $round_robin 
>> to get one slot form each node before taking a second from particular 
>> machines. If you prefer a fixed allocation, you could also put an integer 
>> here.
> 
> Remember, OMPI only uses sge to launch one daemon/node. The placement of MPI 
> procs is totally up to mpirun itself, which doesn't look at any SGE envar.

I thought this is the purpose to use --with-sge during configure as you don't 
have to provide any hostlist at all and Open MPI will honor it by reading SGE 
envars to get the granted slots?

-- Reuti


>> 
>> 
>>> control_slaves     TRUE
>>> job_is_first_task  FALSE
>>> urgency_slots      min
>>> accounting_summary TRUE
>>> 
>>> Wouldn't the -bynode allocation be really inefficient? Does the -bynode 
>>> switch imply only one slot is used on each node before it moves on to the 
>>> next?
>> 
>> Do I get it right: inside the granted slots by SGE you want the allocation 
>> inside Open MPI to follow a specific pattern, i.e.: which rank is where?
>> 
>> -- Reuti
>> 
>> 
>>> 
>>> Thanks for your help Ralph. At least I have some ideas on where to look now.
>>> 
>>> -Bill
>>> ________________________________________
>>> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of 
>>> Ralph Castain [r...@open-mpi.org]
>>> Sent: Tuesday, July 26, 2011 6:32 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in  
>>>     cluster, but nothing more than that
>>> 
>>> A few thoughts:
>>> 
>>> * including both btl_tcp_if_include and btl_tcp_if_exclude is problematic 
>>> as they are mutually exclusive options. I'm not sure which one will take 
>>> precedence. I would suggest only using one of them.
>>> 
>>> * the default mapping algorithm is byslot - i.e., OMPI will place procs on 
>>> each node of the cluster until all slots on that node have been filled, and 
>>> then moves to the next node. Depending on what you have in your 
>>> machinefile, it is possible that all 88 procs are being placed on the first 
>>> node. You might try spreading your procs across all nodes with -bynode on 
>>> the cmd line, or check to ensure that the machinefile is correctly 
>>> specifying the number of slots on each node. Note: OMPI will automatically 
>>> read the SGE environment to get the host allocation, so the only reason for 
>>> providing a machinefile is if you don't want the full allocation used.
>>> 
>>> * 88*88 = 7744. MPI transport connections are point-to-point - i.e., each 
>>> proc opens a unique connection to another proc. If your procs are all 
>>> winding up on the same node, for example, then the system will want at 
>>> least 7744 file descriptors on that node, assuming your application does a 
>>> complete wireup across all procs.
>>> 
>>> Updating to 1.4.3 would be a good idea as it is more stable, but it may not 
>>> resolve this problem if the issue is one of the above.
>>> 
>>> HTH
>>> Ralph
>>> 
>>> 
>>> On Jul 25, 2011, at 11:23 PM, Lane, William wrote:
>>> 
>>>> Please help me resolve the following problems with a 306 node Rocks 
>>>> cluster using SGE. Please note I can run the
>>>> job successfully on <87 slots, but not anymore than that.
>>>> 
>>>> We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub 
>>>> and the following lines from a script:
>>>> 
>>>> mpirun -n $NSLOTS  -machinefile $TMPDIR/machines --mca btl_tcp_if_include 
>>>> eth0 --mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca 
>>>> opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1 
>>>> /stf/billstst/ProcessColors2MPICH1
>>>> echo "MPICH1 mpirun returned #?"
>>>> 
>>>> eth1 is the connection to the Isilon NAS, where the object file is  
>>>> located.
>>>> 
>>>> The error messages returned are of the form:
>>>> 
>>>> WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open 
>>>> was reached
>>>> WRT ORTE_ERROR_LOG: The system limit on number of network connections a 
>>>> process can open was reached in file oob_tcp.c at line 447
>>>> 
>>>> We have increased the open file limit to 4096 from 1024, problem still 
>>>> exists.
>>>> 
>>>> I can run the same test code via MPICH2 successfully on all 696 slots of 
>>>> the cluster, but I can't run the
>>>> same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots.
>>>> 
>>>> Here's the details on the installed version of Open MPI:
>>>> 
>>>> [root}# ./ompi_info
>>>>              Package: Open MPI r...@build-x86-64.rocksclusters.org 
>>>> Distribution
>>>>             Open MPI: 1.3.3
>>>> Open MPI SVN revision: r21666
>>>> Open MPI release date: Jul 14, 2009
>>>>             Open RTE: 1.3.3
>>>> Open RTE SVN revision: r21666
>>>> Open RTE release date: Jul 14, 2009
>>>>                 OPAL: 1.3.3
>>>>    OPAL SVN revision: r21666
>>>>    OPAL release date: Jul 14, 2009
>>>>         Ident string: 1.3.3
>>>>               Prefix: /opt/openmpi
>>>> Configured architecture: x86_64-unknown-linux-gnu
>>>>       Configure host: build-x86-64.rocksclusters.org
>>>>        Configured by: root
>>>>        Configured on: Sat Dec 12 16:29:23 PST 2009
>>>>       Configure host: build-x86-64.rocksclusters.org
>>>>             Built by: bruno
>>>>             Built on: Sat Dec 12 16:42:52 PST 2009
>>>>           Built host: build-x86-64.rocksclusters.org
>>>>           C bindings: yes
>>>>         C++ bindings: yes
>>>>   Fortran77 bindings: yes (all)
>>>>   Fortran90 bindings: yes
>>>> Fortran90 bindings size: small
>>>>           C compiler: gcc
>>>>  C compiler absolute: /usr/bin/gcc
>>>>         C++ compiler: g++
>>>> C++ compiler absolute: /usr/bin/g++
>>>>   Fortran77 compiler: gfortran
>>>> Fortran77 compiler abs: /usr/bin/gfortran
>>>>   Fortran90 compiler: gfortran
>>>> Fortran90 compiler abs: /usr/bin/gfortran
>>>>          C profiling: yes
>>>>        C++ profiling: yes
>>>>  Fortran77 profiling: yes
>>>>  Fortran90 profiling: yes
>>>>       C++ exceptions: no
>>>>       Thread support: posix (mpi: no, progress: no)
>>>>        Sparse Groups: no
>>>> Internal debug support: no
>>>>  MPI parameter check: runtime
>>>> Memory profiling support: no
>>>> Memory debugging support: no
>>>>      libltdl support: yes
>>>> Heterogeneous support: no
>>>> mpirun default --prefix: no
>>>>      MPI I/O support: yes
>>>>    MPI_WTIME support: gettimeofday
>>>> Symbol visibility support: yes
>>>> FT Checkpoint support: no  (checkpoint thread: no)
>>>>        MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
>>>>           MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
>>>>        MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
>>>>        MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
>>>>      MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
>>>>      MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
>>>>           MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
>>>>        MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
>>>>        MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
>>>>               MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
>>>>           MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
>>>>             MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
>>>>           MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
>>>>           MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
>>>>           MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>>>            MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
>>>>           MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
>>>>              MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
>>>>          MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
>>>>          MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)
>>>> 
>>>> Would upgrading to the latest version of OpenMPI (1.4.3) resolve this 
>>>> issue?
>>>> 
>>>> Thank you,
>>>> 
>>>> -Bill Lane
>>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>>> entity to which it is addressed and may contain information that is 
>>>> privileged and confidential, the disclosure of which is governed by 
>>>> applicable law. If the reader of this message is not the intended 
>>>> recipient, or the employee or agent responsible for delivering it to the 
>>>> intended recipient, you are hereby notified that any dissemination, 
>>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>>> have received this message in error, please notify us immediately by 
>>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>>> cooperation.
>>>> IMPORTANT WARNING:  This message is intended for the use of the person or 
>>>> entity to which it is addressed and may contain information that is 
>>>> privileged and confidential, the disclosure of which is governed by
>>>> applicable law.  If the reader of this message is not the intended 
>>>> recipient, or the employee or agent responsible for delivering it to the 
>>>> intended recipient, you are hereby notified that any dissemination, 
>>>> distribution or copying of this information is STRICTLY PROHIBITED.
>>>> 
>>>> If you have received this message in error, please notify us immediately
>>>> by calling (310) 423-6428 and destroy the related message.  Thank You for 
>>>> your cooperation.
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by 
>>> applicable law. If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>> have received this message in error, please notify us immediately by 
>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>> cooperation.
>>> IMPORTANT WARNING:  This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by
>>> applicable law.  If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is STRICTLY PROHIBITED.
>>> 
>>> If you have received this message in error, please notify us immediately
>>> by calling (310) 423-6428 and destroy the related message.  Thank You for 
>>> your cooperation.
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to