I’m not sure our man pages are good enough to answer your question, but here is the URL
http://www.open-mpi.org/doc/v1.8/ <http://www.open-mpi.org/doc/v1.8/> I’m a tad tied up right now, but I’ll try to address this prior to 1.8.5 release. Thanks for all that debug effort! Helps a bunch. > On Apr 7, 2015, at 1:17 PM, Lane, William <william.l...@cshs.org> wrote: > > Ralph, > > I've finally had some luck using the following: > $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single > --mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix > $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN > > Where $NSLOTS was 56 and my hostfile hostfile-single is: > > csclprd3-0-0 slots=12 max-slots=24 > csclprd3-0-1 slots=6 max-slots=12 > csclprd3-0-2 slots=6 max-slots=12 > csclprd3-0-3 slots=6 max-slots=12 > csclprd3-0-4 slots=6 max-slots=12 > csclprd3-0-5 slots=6 max-slots=12 > csclprd3-0-6 slots=6 max-slots=12 > csclprd3-6-1 slots=4 max-slots=4 > csclprd3-6-5 slots=4 max-slots=4 > > The max-slots differs from slots on some nodes > because I include the hyperthreaded cores in > the max-slots, the last two nodes have CPU's that > don't support hyperthreading at all. > > Does --use-hwthread-cpus prevent slots from > being assigned to hyperthreading cores? > > For some reason the manpage for OpenMPI 1.8.2 > isn't installed on our CentOS 6.3 systems is there a > URL I can I find a copy of the manpages for OpenMPI 1.8.2? > > Thanks for your help, > > -Bill Lane > > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Monday, April 06, 2015 1:39 PM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 > > Hmmm…well, that shouldn’t be the issue. To check, try running it with > “bind-to none”. If you can get a backtrace telling us where it is crashing, > that would also help. > > >> On Apr 6, 2015, at 12:24 PM, Lane, William <william.l...@cshs.org >> <mailto:william.l...@cshs.org>> wrote: >> >> Ralph, >> >> For the following two different commandline invocations of the LAPACK >> benchmark >> >> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile >> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes >> --use-hwthread-cpus --bind-to hwthread --prefix $MPI_DIR >> $BENCH_DIR/$APP_DIR/$APP_BIN >> >> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile >> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes >> --bind-to-core --prefix $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN >> >> I'm receiving the same kinds of OpenMPI error messages (but for different >> nodes in the ring): >> >> [csclprd3-0-16:25940] *** Process received signal *** >> [csclprd3-0-16:25940] Signal: Bus error (7) >> [csclprd3-0-16:25940] Signal code: Non-existant physical address (2) >> [csclprd3-0-16:25940] Failing at address: 0x7f8b1b5a2600 >> >> >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 82 with PID 25936 on node >> csclprd3-0-16 exited on signal 7 (Bus error). >> >> -------------------------------------------------------------------------- >> 16 total processes killed (some possibly by mpirun during cleanup) >> >> It seems to occur on systems that have more than one, physical CPU >> installed. Could >> this be due to a lack of the correct NUMA libraries being installed? >> >> -Bill L. >> >> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] >> on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>] >> Sent: Sunday, April 05, 2015 6:09 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >> >> >>> On Apr 5, 2015, at 5:58 PM, Lane, William <william.l...@cshs.org >>> <mailto:william.l...@cshs.org>> wrote: >>> >>> I think some of the Intel Blade systems in the cluster are >>> dual core, but don't support hyperthreading. Maybe it >>> would be better to exclude hyperthreading altogether >>> from submitted OpenMPI jobs? >> >> Yes - or you can add "--hetero-nodes -use-hwthread-cpus --bind-to hwthread" >> to the cmd line. This tells mpirun that the nodes aren't all the same, and >> so it has to look at each node's topology instead of taking the first node >> as the template for everything. The second tells it to use the HTs as >> independent cpus where they are supported. >> >> I'm not entirely sure the suggestion will work - if we hit a place where HT >> isn't supported, we may balk at being asked to bind to HTs. I can probably >> make a change that supports this kind of hetero arrangement (perhaps >> something like bind-to pu) - might make it into 1.8.5 (we are just starting >> the release process on it now). >> >>> >>> OpenMPI doesn't crash, but it doesn't run the LAPACK >>> benchmark either. >>> >>> Thanks again Ralph. >>> >>> Bill L. >>> >>> From: users [users-boun...@open-mpi.org >>> <mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain >>> [r...@open-mpi.org <mailto:r...@open-mpi.org>] >>> Sent: Wednesday, April 01, 2015 8:40 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >>> >>> Bingo - you said the magic word. This is a terminology issue. When we say >>> "core", we mean the old definition of "core", not "hyperthreads". If you >>> want to use HTs as your base processing unit and bind to them, then you >>> need to specify --bind-to hwthread. That warning should then go away. >>> >>> We don't require a swap region be mounted - I didn't see anything in your >>> original message indicating that OMPI had actually crashed, but just wasn't >>> launching due to the above issue. Were you actually seeing crashes as well? >>> >>> >>> On Wed, Apr 1, 2015 at 8:31 AM, Lane, William <william.l...@cshs.org >>> <mailto:william.l...@cshs.org>> wrote: >>> Ralph, >>> >>> Here's the associated hostfile: >>> >>> #openMPI hostfile for csclprd3 >>> #max slots prevents oversubscribing csclprd3-0-9 >>> csclprd3-0-0 slots=12 max-slots=12 >>> csclprd3-0-1 slots=6 max-slots=6 >>> csclprd3-0-2 slots=6 max-slots=6 >>> csclprd3-0-3 slots=6 max-slots=6 >>> csclprd3-0-4 slots=6 max-slots=6 >>> csclprd3-0-5 slots=6 max-slots=6 >>> csclprd3-0-6 slots=6 max-slots=6 >>> csclprd3-0-7 slots=32 max-slots=32 >>> csclprd3-0-8 slots=32 max-slots=32 >>> csclprd3-0-9 slots=32 max-slots=32 >>> csclprd3-0-10 slots=32 max-slots=32 >>> csclprd3-0-11 slots=32 max-slots=32 >>> csclprd3-0-12 slots=12 max-slots=12 >>> csclprd3-0-13 slots=24 max-slots=24 >>> csclprd3-0-14 slots=16 max-slots=16 >>> csclprd3-0-15 slots=16 max-slots=16 >>> csclprd3-0-16 slots=24 max-slots=24 >>> csclprd3-0-17 slots=24 max-slots=24 >>> csclprd3-6-1 slots=4 max-slots=4 >>> csclprd3-6-5 slots=4 max-slots=4 >>> >>> The number of slots also includes hyperthreading >>> cores. >>> >>> One more question, would not having defined swap >>> partitions on all the nodes in the ring cause OpenMPI >>> to crash? Because no swap partitions are defined >>> for any of the above systems. >>> >>> -Bill L. >>> >>> >>> From: users [users-boun...@open-mpi.org >>> <mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain >>> [r...@open-mpi.org <mailto:r...@open-mpi.org>] >>> Sent: Wednesday, April 01, 2015 5:04 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >>> >>> The warning about binding to memory is due to not having numactl-devel >>> installed on the system. The job would still run, but we are warning you >>> that we cannot bind memory to the same domain as the core where we bind the >>> process. Can cause poor performance, but not fatal. I forget the name of >>> the param, but you can tell us to "shut up" :-) >>> >>> The other warning/error indicates that we aren't seeing enough cores on the >>> allocation you gave us via the hostile to support one proc/core - i.e., we >>> didn't at least 128 cores in the sum of the nodes you told us about. I take >>> it you were expecting that there were that many or more? >>> >>> Ralph >>> >>> >>> On Wed, Apr 1, 2015 at 12:54 AM, Lane, William <william.l...@cshs.org >>> <mailto:william.l...@cshs.org>> wrote: >>> I'm having problems running OpenMPI jobs >>> (using a hostfile) on an HPC cluster running >>> ROCKS on CentOS 6.3. I'm running OpenMPI >>> outside of Sun Grid Engine (i.e. it is not submitted >>> as a job to SGE). The program being run is a LAPACK >>> benchmark. The commandline parameter I'm >>> using to run the jobs is: >>> >>> $MPI_DIR/bin/mpirun -np $NSLOTS -bind-to-core -report-bindings --hostfile >>> hostfile --mca btl_tcp_if_include eth0 --prefix $MPI_DIR >>> $BENCH_DIR/$APP_DIR/$APP_BIN >>> >>> Where MPI_DIR=/hpc/apps/mpi/openmpi/1.8.2/ >>> NSLOTS=128 >>> >>> I'm getting errors of the form and OpenMPI never runs the LAPACK benchmark: >>> >>> >>> -------------------------------------------------------------------------- >>> WARNING: a request was made to bind a process. While the system >>> supports binding the process itself, at least one node does NOT >>> support binding memory to the process location. >>> >>> Node: csclprd3-0-11 >>> >>> This usually is due to not having the required NUMA support installed >>> on the node. In some Linux distributions, the required support is >>> contained in the libnumactl and libnumactl-devel packages. >>> This is a warning only; your job will continue, though performance may >>> be degraded. >>> >>> -------------------------------------------------------------------------- >>> >>> >>> -------------------------------------------------------------------------- >>> A request was made to bind to that would result in binding more >>> processes than cpus on a resource: >>> >>> Bind to: CORE >>> Node: csclprd3-0-11 >>> #processes: 2 >>> #cpus: 1 >>> >>> You can override this protection by adding the "overload-allowed" >>> option to your binding directive. >>> >>> -------------------------------------------------------------------------- >>> >>> The only installed numa packages are: >>> numactl.x86_64 2.0.7-3.el6 >>> @centos6.3-x86_64-0/$ >>> >>> When I search for the available NUMA packages I find: >>> >>> yum search numa | less >>> >>> Loaded plugins: fastestmirror >>> Loading mirror speeds from cached hostfile >>> ============================== N/S Matched: numa >>> =============================== >>> numactl-devel.i686 : Development package for building Applications >>> that use numa >>> numactl-devel.x86_64 : Development package for building >>> Applications that use >>> : numa >>> numad.x86_64 : NUMA user daemon >>> numactl.i686 : Library for tuning for Non Uniform Memory Access >>> machines >>> numactl.x86_64 : Library for tuning for Non Uniform Memory Access >>> machines >>> >>> Do I need to install additional and/or different NUMA packages in order to >>> get OpenMPI to work >>> on this cluster? >>> >>> -Bill Lane >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is strictly prohibited. Thank >>> you for your cooperation. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Searchable archives: >>> http://www.open-mpi.org/community/lists/users/2015/04/index.php >>> <http://www.open-mpi.org/community/lists/users/2015/04/index.php> >>> >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is strictly prohibited. Thank >>> you for your cooperation. >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/04/26589.php >>> <http://www.open-mpi.org/community/lists/users/2015/04/26589.php> >>> >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is strictly prohibited. Thank >>> you for your cooperation. _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/04/26611.php >>> <http://www.open-mpi.org/community/lists/users/2015/04/26611.php> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation. _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26618.php >> <http://www.open-mpi.org/community/lists/users/2015/04/26618.php> > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is strictly prohibited. Thank you for your > cooperation. _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26643.php > <http://www.open-mpi.org/community/lists/users/2015/04/26643.php>