On Jul 26, 2011, at 1:39 PM, Reuti wrote: > Hi, > > Am 26.07.2011 um 21:19 schrieb Lane, William: > >> I can successfully run the MPI testcode via OpenMPI 1.3.3 on less than 87 >> slots w/both the btl_tcp_if_exclude and btl_tcp_if_include switches >> passed to mpirun. >> >> SGE always allocates the qsub jobs from the 24 slot nodes first -- up to the >> 96 slots that these 4 nodes have available (on the largeMem.q). The rest of >> the 602 slots are allocated >> from 2 slot nodes (all.q). All requests of up to 96 slots are serviced by >> the largeMem.q nodes (which have 24 slots apiece). Anything over 96 slots is >> serviced first by the largeMem.q >> nodes then by the all.q nodes. > > did you setup a JSV for it, as PE have no sequence numbers in case a PE is > requested? > > >> Here's the PE that I'm using: >> >> mpich PE (Parallel Environment) queue: >> >> pe_name mpich >> slots 9999 >> user_lists NONE >> xuser_lists NONE >> start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile >> stop_proc_args /opt/gridengine/mpi/stopmpi.sh > > These both above can be set to NONE when you compiled Open MPI with SGE > integration --with-sge > > NB: what is defined in rsh_daemon/rsh_command in `qconf -sconf`? > > >> allocation_rule $fill_up > > Here you specify to fill one machine after the other completely before > gathering slots from the next machine. You can change this to $round_robin to > get one slot form each node before taking a second from particular machines. > If you prefer a fixed allocation, you could also put an integer here.
Remember, OMPI only uses sge to launch one daemon/node. The placement of MPI procs is totally up to mpirun itself, which doesn't look at any SGE envar. > > >> control_slaves TRUE >> job_is_first_task FALSE >> urgency_slots min >> accounting_summary TRUE >> >> Wouldn't the -bynode allocation be really inefficient? Does the -bynode >> switch imply only one slot is used on each node before it moves on to the >> next? > > Do I get it right: inside the granted slots by SGE you want the allocation > inside Open MPI to follow a specific pattern, i.e.: which rank is where? > > -- Reuti > > >> >> Thanks for your help Ralph. At least I have some ideas on where to look now. >> >> -Bill >> ________________________________________ >> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of >> Ralph Castain [r...@open-mpi.org] >> Sent: Tuesday, July 26, 2011 6:32 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in >> cluster, but nothing more than that >> >> A few thoughts: >> >> * including both btl_tcp_if_include and btl_tcp_if_exclude is problematic as >> they are mutually exclusive options. I'm not sure which one will take >> precedence. I would suggest only using one of them. >> >> * the default mapping algorithm is byslot - i.e., OMPI will place procs on >> each node of the cluster until all slots on that node have been filled, and >> then moves to the next node. Depending on what you have in your machinefile, >> it is possible that all 88 procs are being placed on the first node. You >> might try spreading your procs across all nodes with -bynode on the cmd >> line, or check to ensure that the machinefile is correctly specifying the >> number of slots on each node. Note: OMPI will automatically read the SGE >> environment to get the host allocation, so the only reason for providing a >> machinefile is if you don't want the full allocation used. >> >> * 88*88 = 7744. MPI transport connections are point-to-point - i.e., each >> proc opens a unique connection to another proc. If your procs are all >> winding up on the same node, for example, then the system will want at least >> 7744 file descriptors on that node, assuming your application does a >> complete wireup across all procs. >> >> Updating to 1.4.3 would be a good idea as it is more stable, but it may not >> resolve this problem if the issue is one of the above. >> >> HTH >> Ralph >> >> >> On Jul 25, 2011, at 11:23 PM, Lane, William wrote: >> >>> Please help me resolve the following problems with a 306 node Rocks cluster >>> using SGE. Please note I can run the >>> job successfully on <87 slots, but not anymore than that. >>> >>> We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub >>> and the following lines from a script: >>> >>> mpirun -n $NSLOTS -machinefile $TMPDIR/machines --mca btl_tcp_if_include >>> eth0 --mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca >>> opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1 >>> /stf/billstst/ProcessColors2MPICH1 >>> echo "MPICH1 mpirun returned #?" >>> >>> eth1 is the connection to the Isilon NAS, where the object file is located. >>> >>> The error messages returned are of the form: >>> >>> WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open >>> was reached >>> WRT ORTE_ERROR_LOG: The system limit on number of network connections a >>> process can open was reached in file oob_tcp.c at line 447 >>> >>> We have increased the open file limit to 4096 from 1024, problem still >>> exists. >>> >>> I can run the same test code via MPICH2 successfully on all 696 slots of >>> the cluster, but I can't run the >>> same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots. >>> >>> Here's the details on the installed version of Open MPI: >>> >>> [root}# ./ompi_info >>> Package: Open MPI r...@build-x86-64.rocksclusters.org >>> Distribution >>> Open MPI: 1.3.3 >>> Open MPI SVN revision: r21666 >>> Open MPI release date: Jul 14, 2009 >>> Open RTE: 1.3.3 >>> Open RTE SVN revision: r21666 >>> Open RTE release date: Jul 14, 2009 >>> OPAL: 1.3.3 >>> OPAL SVN revision: r21666 >>> OPAL release date: Jul 14, 2009 >>> Ident string: 1.3.3 >>> Prefix: /opt/openmpi >>> Configured architecture: x86_64-unknown-linux-gnu >>> Configure host: build-x86-64.rocksclusters.org >>> Configured by: root >>> Configured on: Sat Dec 12 16:29:23 PST 2009 >>> Configure host: build-x86-64.rocksclusters.org >>> Built by: bruno >>> Built on: Sat Dec 12 16:42:52 PST 2009 >>> Built host: build-x86-64.rocksclusters.org >>> C bindings: yes >>> C++ bindings: yes >>> Fortran77 bindings: yes (all) >>> Fortran90 bindings: yes >>> Fortran90 bindings size: small >>> C compiler: gcc >>> C compiler absolute: /usr/bin/gcc >>> C++ compiler: g++ >>> C++ compiler absolute: /usr/bin/g++ >>> Fortran77 compiler: gfortran >>> Fortran77 compiler abs: /usr/bin/gfortran >>> Fortran90 compiler: gfortran >>> Fortran90 compiler abs: /usr/bin/gfortran >>> C profiling: yes >>> C++ profiling: yes >>> Fortran77 profiling: yes >>> Fortran90 profiling: yes >>> C++ exceptions: no >>> Thread support: posix (mpi: no, progress: no) >>> Sparse Groups: no >>> Internal debug support: no >>> MPI parameter check: runtime >>> Memory profiling support: no >>> Memory debugging support: no >>> libltdl support: yes >>> Heterogeneous support: no >>> mpirun default --prefix: no >>> MPI I/O support: yes >>> MPI_WTIME support: gettimeofday >>> Symbol visibility support: yes >>> FT Checkpoint support: no (checkpoint thread: no) >>> MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3) >>> MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3) >>> >>> Would upgrading to the latest version of OpenMPI (1.4.3) resolve this issue? >>> >>> Thank you, >>> >>> -Bill Lane >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. If you >>> have received this message in error, please notify us immediately by >>> calling (310) 423-6428 and destroy the related message. Thank You for your >>> cooperation. >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. >>> >>> If you have received this message in error, please notify us immediately >>> by calling (310) 423-6428 and destroy the related message. Thank You for >>> your cooperation. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is STRICTLY PROHIBITED. If you have received >> this message in error, please notify us immediately by calling (310) >> 423-6428 and destroy the related message. Thank You for your cooperation. >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended >> recipient, or the employee or agent responsible for delivering it to the >> intended recipient, you are hereby notified that any dissemination, >> distribution or copying of this information is STRICTLY PROHIBITED. >> >> If you have received this message in error, please notify us immediately >> by calling (310) 423-6428 and destroy the related message. Thank You for >> your cooperation. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users