Am 26.07.2011 um 21:51 schrieb Ralph Castain: > > On Jul 26, 2011, at 1:39 PM, Reuti wrote: > >> Hi, >> >> Am 26.07.2011 um 21:19 schrieb Lane, William: >> >>> I can successfully run the MPI testcode via OpenMPI 1.3.3 on less than 87 >>> slots w/both the btl_tcp_if_exclude and btl_tcp_if_include switches >>> passed to mpirun. >>> >>> SGE always allocates the qsub jobs from the 24 slot nodes first -- up to >>> the 96 slots that these 4 nodes have available (on the largeMem.q). The >>> rest of the 602 slots are allocated >>> from 2 slot nodes (all.q). All requests of up to 96 slots are serviced by >>> the largeMem.q nodes (which have 24 slots apiece). Anything over 96 slots >>> is serviced first by the largeMem.q >>> nodes then by the all.q nodes. >> >> did you setup a JSV for it, as PE have no sequence numbers in case a PE is >> requested? >> >> >>> Here's the PE that I'm using: >>> >>> mpich PE (Parallel Environment) queue: >>> >>> pe_name mpich >>> slots 9999 >>> user_lists NONE >>> xuser_lists NONE >>> start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile >>> stop_proc_args /opt/gridengine/mpi/stopmpi.sh >> >> These both above can be set to NONE when you compiled Open MPI with SGE >> integration --with-sge >> >> NB: what is defined in rsh_daemon/rsh_command in `qconf -sconf`? >> >> >>> allocation_rule $fill_up >> >> Here you specify to fill one machine after the other completely before >> gathering slots from the next machine. You can change this to $round_robin >> to get one slot form each node before taking a second from particular >> machines. If you prefer a fixed allocation, you could also put an integer >> here. > > Remember, OMPI only uses sge to launch one daemon/node. The placement of MPI > procs is totally up to mpirun itself, which doesn't look at any SGE envar.
I thought this is the purpose to use --with-sge during configure as you don't have to provide any hostlist at all and Open MPI will honor it by reading SGE envars to get the granted slots? -- Reuti >> >> >>> control_slaves TRUE >>> job_is_first_task FALSE >>> urgency_slots min >>> accounting_summary TRUE >>> >>> Wouldn't the -bynode allocation be really inefficient? Does the -bynode >>> switch imply only one slot is used on each node before it moves on to the >>> next? >> >> Do I get it right: inside the granted slots by SGE you want the allocation >> inside Open MPI to follow a specific pattern, i.e.: which rank is where? >> >> -- Reuti >> >> >>> >>> Thanks for your help Ralph. At least I have some ideas on where to look now. >>> >>> -Bill >>> ________________________________________ >>> From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf of >>> Ralph Castain [r...@open-mpi.org] >>> Sent: Tuesday, July 26, 2011 6:32 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in >>> cluster, but nothing more than that >>> >>> A few thoughts: >>> >>> * including both btl_tcp_if_include and btl_tcp_if_exclude is problematic >>> as they are mutually exclusive options. I'm not sure which one will take >>> precedence. I would suggest only using one of them. >>> >>> * the default mapping algorithm is byslot - i.e., OMPI will place procs on >>> each node of the cluster until all slots on that node have been filled, and >>> then moves to the next node. Depending on what you have in your >>> machinefile, it is possible that all 88 procs are being placed on the first >>> node. You might try spreading your procs across all nodes with -bynode on >>> the cmd line, or check to ensure that the machinefile is correctly >>> specifying the number of slots on each node. Note: OMPI will automatically >>> read the SGE environment to get the host allocation, so the only reason for >>> providing a machinefile is if you don't want the full allocation used. >>> >>> * 88*88 = 7744. MPI transport connections are point-to-point - i.e., each >>> proc opens a unique connection to another proc. If your procs are all >>> winding up on the same node, for example, then the system will want at >>> least 7744 file descriptors on that node, assuming your application does a >>> complete wireup across all procs. >>> >>> Updating to 1.4.3 would be a good idea as it is more stable, but it may not >>> resolve this problem if the issue is one of the above. >>> >>> HTH >>> Ralph >>> >>> >>> On Jul 25, 2011, at 11:23 PM, Lane, William wrote: >>> >>>> Please help me resolve the following problems with a 306 node Rocks >>>> cluster using SGE. Please note I can run the >>>> job successfully on <87 slots, but not anymore than that. >>>> >>>> We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub >>>> and the following lines from a script: >>>> >>>> mpirun -n $NSLOTS -machinefile $TMPDIR/machines --mca btl_tcp_if_include >>>> eth0 --mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca >>>> opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1 >>>> /stf/billstst/ProcessColors2MPICH1 >>>> echo "MPICH1 mpirun returned #?" >>>> >>>> eth1 is the connection to the Isilon NAS, where the object file is >>>> located. >>>> >>>> The error messages returned are of the form: >>>> >>>> WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open >>>> was reached >>>> WRT ORTE_ERROR_LOG: The system limit on number of network connections a >>>> process can open was reached in file oob_tcp.c at line 447 >>>> >>>> We have increased the open file limit to 4096 from 1024, problem still >>>> exists. >>>> >>>> I can run the same test code via MPICH2 successfully on all 696 slots of >>>> the cluster, but I can't run the >>>> same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots. >>>> >>>> Here's the details on the installed version of Open MPI: >>>> >>>> [root}# ./ompi_info >>>> Package: Open MPI r...@build-x86-64.rocksclusters.org >>>> Distribution >>>> Open MPI: 1.3.3 >>>> Open MPI SVN revision: r21666 >>>> Open MPI release date: Jul 14, 2009 >>>> Open RTE: 1.3.3 >>>> Open RTE SVN revision: r21666 >>>> Open RTE release date: Jul 14, 2009 >>>> OPAL: 1.3.3 >>>> OPAL SVN revision: r21666 >>>> OPAL release date: Jul 14, 2009 >>>> Ident string: 1.3.3 >>>> Prefix: /opt/openmpi >>>> Configured architecture: x86_64-unknown-linux-gnu >>>> Configure host: build-x86-64.rocksclusters.org >>>> Configured by: root >>>> Configured on: Sat Dec 12 16:29:23 PST 2009 >>>> Configure host: build-x86-64.rocksclusters.org >>>> Built by: bruno >>>> Built on: Sat Dec 12 16:42:52 PST 2009 >>>> Built host: build-x86-64.rocksclusters.org >>>> C bindings: yes >>>> C++ bindings: yes >>>> Fortran77 bindings: yes (all) >>>> Fortran90 bindings: yes >>>> Fortran90 bindings size: small >>>> C compiler: gcc >>>> C compiler absolute: /usr/bin/gcc >>>> C++ compiler: g++ >>>> C++ compiler absolute: /usr/bin/g++ >>>> Fortran77 compiler: gfortran >>>> Fortran77 compiler abs: /usr/bin/gfortran >>>> Fortran90 compiler: gfortran >>>> Fortran90 compiler abs: /usr/bin/gfortran >>>> C profiling: yes >>>> C++ profiling: yes >>>> Fortran77 profiling: yes >>>> Fortran90 profiling: yes >>>> C++ exceptions: no >>>> Thread support: posix (mpi: no, progress: no) >>>> Sparse Groups: no >>>> Internal debug support: no >>>> MPI parameter check: runtime >>>> Memory profiling support: no >>>> Memory debugging support: no >>>> libltdl support: yes >>>> Heterogeneous support: no >>>> mpirun default --prefix: no >>>> MPI I/O support: yes >>>> MPI_WTIME support: gettimeofday >>>> Symbol visibility support: yes >>>> FT Checkpoint support: no (checkpoint thread: no) >>>> MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3) >>>> MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3) >>>> >>>> Would upgrading to the latest version of OpenMPI (1.4.3) resolve this >>>> issue? >>>> >>>> Thank you, >>>> >>>> -Bill Lane >>>> IMPORTANT WARNING: This message is intended for the use of the person or >>>> entity to which it is addressed and may contain information that is >>>> privileged and confidential, the disclosure of which is governed by >>>> applicable law. If the reader of this message is not the intended >>>> recipient, or the employee or agent responsible for delivering it to the >>>> intended recipient, you are hereby notified that any dissemination, >>>> distribution or copying of this information is STRICTLY PROHIBITED. If you >>>> have received this message in error, please notify us immediately by >>>> calling (310) 423-6428 and destroy the related message. Thank You for your >>>> cooperation. >>>> IMPORTANT WARNING: This message is intended for the use of the person or >>>> entity to which it is addressed and may contain information that is >>>> privileged and confidential, the disclosure of which is governed by >>>> applicable law. If the reader of this message is not the intended >>>> recipient, or the employee or agent responsible for delivering it to the >>>> intended recipient, you are hereby notified that any dissemination, >>>> distribution or copying of this information is STRICTLY PROHIBITED. >>>> >>>> If you have received this message in error, please notify us immediately >>>> by calling (310) 423-6428 and destroy the related message. Thank You for >>>> your cooperation. >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. If you >>> have received this message in error, please notify us immediately by >>> calling (310) 423-6428 and destroy the related message. Thank You for your >>> cooperation. >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. >>> >>> If you have received this message in error, please notify us immediately >>> by calling (310) 423-6428 and destroy the related message. Thank You for >>> your cooperation. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users