Hmmm…well, that shouldn’t be the issue. To check, try running it with “bind-to none”. If you can get a backtrace telling us where it is crashing, that would also help.
> On Apr 6, 2015, at 12:24 PM, Lane, William <william.l...@cshs.org> wrote: > > Ralph, > > For the following two different commandline invocations of the LAPACK > benchmark > > $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile > hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes > --use-hwthread-cpus --bind-to hwthread --prefix $MPI_DIR > $BENCH_DIR/$APP_DIR/$APP_BIN > > $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile > hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes --bind-to-core > --prefix $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN > > I'm receiving the same kinds of OpenMPI error messages (but for different > nodes in the ring): > > [csclprd3-0-16:25940] *** Process received signal *** > [csclprd3-0-16:25940] Signal: Bus error (7) > [csclprd3-0-16:25940] Signal code: Non-existant physical address (2) > [csclprd3-0-16:25940] Failing at address: 0x7f8b1b5a2600 > > > -------------------------------------------------------------------------- > mpirun noticed that process rank 82 with PID 25936 on node > csclprd3-0-16 exited on signal 7 (Bus error). > > -------------------------------------------------------------------------- > 16 total processes killed (some possibly by mpirun during cleanup) > > It seems to occur on systems that have more than one, physical CPU installed. > Could > this be due to a lack of the correct NUMA libraries being installed? > > -Bill L. > > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Sunday, April 05, 2015 6:09 PM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 > > >> On Apr 5, 2015, at 5:58 PM, Lane, William <william.l...@cshs.org >> <mailto:william.l...@cshs.org>> wrote: >> >> I think some of the Intel Blade systems in the cluster are >> dual core, but don't support hyperthreading. Maybe it >> would be better to exclude hyperthreading altogether >> from submitted OpenMPI jobs? > > Yes - or you can add "--hetero-nodes -use-hwthread-cpus --bind-to hwthread" > to the cmd line. This tells mpirun that the nodes aren't all the same, and so > it has to look at each node's topology instead of taking the first node as > the template for everything. The second tells it to use the HTs as > independent cpus where they are supported. > > I'm not entirely sure the suggestion will work - if we hit a place where HT > isn't supported, we may balk at being asked to bind to HTs. I can probably > make a change that supports this kind of hetero arrangement (perhaps > something like bind-to pu) - might make it into 1.8.5 (we are just starting > the release process on it now). > >> >> OpenMPI doesn't crash, but it doesn't run the LAPACK >> benchmark either. >> >> Thanks again Ralph. >> >> Bill L. >> >> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] >> on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>] >> Sent: Wednesday, April 01, 2015 8:40 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >> >> Bingo - you said the magic word. This is a terminology issue. When we say >> "core", we mean the old definition of "core", not "hyperthreads". If you >> want to use HTs as your base processing unit and bind to them, then you need >> to specify --bind-to hwthread. That warning should then go away. >> >> We don't require a swap region be mounted - I didn't see anything in your >> original message indicating that OMPI had actually crashed, but just wasn't >> launching due to the above issue. Were you actually seeing crashes as well? >> >> >> On Wed, Apr 1, 2015 at 8:31 AM, Lane, William <william.l...@cshs.org >> <mailto:william.l...@cshs.org>> wrote: >> Ralph, >> >> Here's the associated hostfile: >> >> #openMPI hostfile for csclprd3 >> #max slots prevents oversubscribing csclprd3-0-9 >> csclprd3-0-0 slots=12 max-slots=12 >> csclprd3-0-1 slots=6 max-slots=6 >> csclprd3-0-2 slots=6 max-slots=6 >> csclprd3-0-3 slots=6 max-slots=6 >> csclprd3-0-4 slots=6 max-slots=6 >> csclprd3-0-5 slots=6 max-slots=6 >> csclprd3-0-6 slots=6 max-slots=6 >> csclprd3-0-7 slots=32 max-slots=32 >> csclprd3-0-8 slots=32 max-slots=32 >> csclprd3-0-9 slots=32 max-slots=32 >> csclprd3-0-10 slots=32 max-slots=32 >> csclprd3-0-11 slots=32 max-slots=32 >> csclprd3-0-12 slots=12 max-slots=12 >> csclprd3-0-13 slots=24 max-slots=24 >> csclprd3-0-14 slots=16 max-slots=16 >> csclprd3-0-15 slots=16 max-slots=16 >> csclprd3-0-16 slots=24 max-slots=24 >> csclprd3-0-17 slots=24 max-slots=24 >> csclprd3-6-1 slots=4 max-slots=4 >> csclprd3-6-5 slots=4 max-slots=4 >> >> The number of slots also includes hyperthreading >> cores. >> >> One more question, would not having defined swap >> partitions on all the nodes in the ring cause OpenMPI >> to crash? Because no swap partitions are defined >> for any of the above systems. >> >> -Bill L. >> >> >> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] >> on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>] >> Sent: Wednesday, April 01, 2015 5:04 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >> >> The warning about binding to memory is due to not having numactl-devel >> installed on the system. The job would still run, but we are warning you >> that we cannot bind memory to the same domain as the core where we bind the >> process. Can cause poor performance, but not fatal. I forget the name of the >> param, but you can tell us to "shut up" :-) >> >> The other warning/error indicates that we aren't seeing enough cores on the >> allocation you gave us via the hostile to support one proc/core - i.e., we >> didn't at least 128 cores in the sum of the nodes you told us about. I take >> it you were expecting that there were that many or more? >> >> Ralph >> >> >> On Wed, Apr 1, 2015 at 12:54 AM, Lane, William <william.l...@cshs.org >> <mailto:william.l...@cshs.org>> wrote: >> I'm having problems running OpenMPI jobs >> (using a hostfile) on an HPC cluster running >> ROCKS on CentOS 6.3. I'm running OpenMPI >> outside of Sun Grid Engine (i.e. it is not submitted >> as a job to SGE). The program being run is a LAPACK >> benchmark. The commandline parameter I'm >> using to run the jobs is: >> >> $MPI_DIR/bin/mpirun -np $NSLOTS -bind-to-core -report-bindings --hostfile >> hostfile --mca btl_tcp_if_include eth0 --prefix $MPI_DIR >> $BENCH_DIR/$APP_DIR/$APP_BIN >> >> Where MPI_DIR=/hpc/apps/mpi/openmpi/1.8.2/ >> NSLOTS=128 >> >> I'm getting errors of the form and OpenMPI never runs the LAPACK benchmark: >> >> -------------------------------------------------------------------------- >> WARNING: a request was made to bind a process. While the system >> supports binding the process itself, at least one node does NOT >> support binding memory to the process location. >> >> Node: csclprd3-0-11 >> >> This usually is due to not having the required NUMA support installed >> on the node. In some Linux distributions, the required support is >> contained in the libnumactl and libnumactl-devel packages. >> This is a warning only; your job will continue, though performance may be >> degraded. >> -------------------------------------------------------------------------- >> >> -------------------------------------------------------------------------- >> A request was made to bind to that would result in binding more >> processes than cpus on a resource: >> >> Bind to: CORE >> Node: csclprd3-0-11 >> #processes: 2 >> #cpus: 1 >> >> You can override this protection by adding the "overload-allowed" >> option to your binding directive. >> -------------------------------------------------------------------------- >> >> The only installed numa packages are: >> numactl.x86_64 2.0.7-3.el6 >> @centos6.3-x86_64-0/$ >> >> When I search for the available NUMA packages I find: >> >> yum search numa | less >> >> Loaded plugins: fastestmirror >> Loading mirror speeds from cached hostfile >> ============================== N/S Matched: numa >> =============================== >> numactl-devel.i686 : Development package for building Applications >> that use numa >> numactl-devel.x86_64 : Development package for building Applications >> that use >> : numa >> numad.x86_64 : NUMA user daemon >> numactl.i686 : Library for tuning for Non Uniform Memory Access >> machines >> numactl.x86_64 : Library for tuning for Non Uniform Memory Access >> machines >> >> Do I need to install additional and/or different NUMA packages in order to >> get OpenMPI to work >> on this cluster? >> >> -Bill Lane >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Searchable archives: >> http://www.open-mpi.org/community/lists/users/2015/04/index.php >> <http://www.open-mpi.org/community/lists/users/2015/04/index.php> >> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26589.php >> <http://www.open-mpi.org/community/lists/users/2015/04/26589.php> >> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation. _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26611.php >> <http://www.open-mpi.org/community/lists/users/2015/04/26611.php> > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is strictly prohibited. Thank you for your > cooperation. _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26618.php > <http://www.open-mpi.org/community/lists/users/2015/04/26618.php>