> On Apr 8, 2015, at 10:20 AM, Lane, William <william.l...@cshs.org> wrote: > > Ralph, > > I just wanted to add that roughly a year ago I was fighting w/these > same issues, but was re-tasked to more pressing issues and had to > abandon looking into these OpenMPI 1.8.2 issues on our CentOS 6.3 > cluster. > > In any case, in digging around I found you had the following > recommendation back then: > > > Argh - yeah, I got confused as things context switched a few too many > > times. The 1.8.2 release should certainly understand that arrangement, and > > --hetero-nodes. The only way it wouldn't see the latter is if you configure > > it --without-hwloc, or hwloc refused to build.
I believe we fixed those issues > > > > Since there was a question about the numactl-devel requirement, I suspect > > that is the root cause of all evil in this case and the lack of > > --hetero-nodes would confirm that diagnosis :-) > > So the numactl-devel library is required for OpenMPI to function on NUMA > nodes? Or maybe just NUMA nodes that also have hyperthreading capabilities? Binding in general requires numactl-devel, whether to HT or non-HT nodes > > Bill L. > > From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] > on behalf of Lane, William [william.l...@cshs.org > <mailto:william.l...@cshs.org>] > Sent: Wednesday, April 08, 2015 9:29 AM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 > > Ralph, > > Thanks for YOUR help, I never > would've managed to get the LAPACK > benchmark running on more than one > node in our cluster without your help. > > Ralph, is hyperthreading more of a curse > than an advantage for HPC applications? > > I'm going to go through all the OpenMPI > articles on hyperthreading and NUMA to > see if that will shed any light on these > issues. > > -Bill L. > > > From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] > on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>] > Sent: Tuesday, April 07, 2015 7:32 PM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 > > I’m not sure our man pages are good enough to answer your question, but here > is the URL > > http://www.open-mpi.org/doc/v1.8/ <http://www.open-mpi.org/doc/v1.8/> > > I’m a tad tied up right now, but I’ll try to address this prior to 1.8.5 > release. Thanks for all that debug effort! Helps a bunch. > >> On Apr 7, 2015, at 1:17 PM, Lane, William <william.l...@cshs.org >> <mailto:william.l...@cshs.org>> wrote: >> >> Ralph, >> >> I've finally had some luck using the following: >> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single >> --mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix >> $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN >> >> Where $NSLOTS was 56 and my hostfile hostfile-single is: >> >> csclprd3-0-0 slots=12 max-slots=24 >> csclprd3-0-1 slots=6 max-slots=12 >> csclprd3-0-2 slots=6 max-slots=12 >> csclprd3-0-3 slots=6 max-slots=12 >> csclprd3-0-4 slots=6 max-slots=12 >> csclprd3-0-5 slots=6 max-slots=12 >> csclprd3-0-6 slots=6 max-slots=12 >> csclprd3-6-1 slots=4 max-slots=4 >> csclprd3-6-5 slots=4 max-slots=4 >> >> The max-slots differs from slots on some nodes >> because I include the hyperthreaded cores in >> the max-slots, the last two nodes have CPU's that >> don't support hyperthreading at all. >> >> Does --use-hwthread-cpus prevent slots from >> being assigned to hyperthreading cores? >> >> For some reason the manpage for OpenMPI 1.8.2 >> isn't installed on our CentOS 6.3 systems is there a >> URL I can I find a copy of the manpages for OpenMPI 1.8.2? >> >> Thanks for your help, >> >> -Bill Lane >> >> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] >> on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>] >> Sent: Monday, April 06, 2015 1:39 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >> >> Hmmm…well, that shouldn’t be the issue. To check, try running it with >> “bind-to none”. If you can get a backtrace telling us where it is crashing, >> that would also help. >> >> >>> On Apr 6, 2015, at 12:24 PM, Lane, William <william.l...@cshs.org >>> <mailto:william.l...@cshs.org>> wrote: >>> >>> Ralph, >>> >>> For the following two different commandline invocations of the LAPACK >>> benchmark >>> >>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile >>> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes >>> --use-hwthread-cpus --bind-to hwthread --prefix $MPI_DIR >>> $BENCH_DIR/$APP_DIR/$APP_BIN >>> >>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile >>> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes >>> --bind-to-core --prefix $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN >>> >>> I'm receiving the same kinds of OpenMPI error messages (but for different >>> nodes in the ring): >>> >>> [csclprd3-0-16:25940] *** Process received signal *** >>> [csclprd3-0-16:25940] Signal: Bus error (7) >>> [csclprd3-0-16:25940] Signal code: Non-existant physical address (2) >>> [csclprd3-0-16:25940] Failing at address: 0x7f8b1b5a2600 >>> >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 82 with PID 25936 on node >>> csclprd3-0-16 exited on signal 7 (Bus error). >>> >>> -------------------------------------------------------------------------- >>> 16 total processes killed (some possibly by mpirun during cleanup) >>> >>> It seems to occur on systems that have more than one, physical CPU >>> installed. Could >>> this be due to a lack of the correct NUMA libraries being installed? >>> >>> -Bill L. >>> >>> From: users [users-boun...@open-mpi.org >>> <mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain >>> [r...@open-mpi.org <mailto:r...@open-mpi.org>] >>> Sent: Sunday, April 05, 2015 6:09 PM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >>> >>> >>>> On Apr 5, 2015, at 5:58 PM, Lane, William <william.l...@cshs.org >>>> <mailto:william.l...@cshs.org>> wrote: >>>> >>>> I think some of the Intel Blade systems in the cluster are >>>> dual core, but don't support hyperthreading. Maybe it >>>> would be better to exclude hyperthreading altogether >>>> from submitted OpenMPI jobs? >>> >>> Yes - or you can add "--hetero-nodes -use-hwthread-cpus --bind-to hwthread" >>> to the cmd line. This tells mpirun that the nodes aren't all the same, and >>> so it has to look at each node's topology instead of taking the first node >>> as the template for everything. The second tells it to use the HTs as >>> independent cpus where they are supported. >>> >>> I'm not entirely sure the suggestion will work - if we hit a place where HT >>> isn't supported, we may balk at being asked to bind to HTs. I can probably >>> make a change that supports this kind of hetero arrangement (perhaps >>> something like bind-to pu) - might make it into 1.8.5 (we are just starting >>> the release process on it now). >>> >>>> >>>> OpenMPI doesn't crash, but it doesn't run the LAPACK >>>> benchmark either. >>>> >>>> Thanks again Ralph. >>>> >>>> Bill L. >>>> >>>> From: users [users-boun...@open-mpi.org >>>> <mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain >>>> [r...@open-mpi.org <mailto:r...@open-mpi.org>] >>>> Sent: Wednesday, April 01, 2015 8:40 AM >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >>>> >>>> Bingo - you said the magic word. This is a terminology issue. When we say >>>> "core", we mean the old definition of "core", not "hyperthreads". If you >>>> want to use HTs as your base processing unit and bind to them, then you >>>> need to specify --bind-to hwthread. That warning should then go away. >>>> >>>> We don't require a swap region be mounted - I didn't see anything in your >>>> original message indicating that OMPI had actually crashed, but just >>>> wasn't launching due to the above issue. Were you actually seeing crashes >>>> as well? >>>> >>>> >>>> On Wed, Apr 1, 2015 at 8:31 AM, Lane, William <william.l...@cshs.org >>>> <mailto:william.l...@cshs.org>> wrote: >>>> Ralph, >>>> >>>> Here's the associated hostfile: >>>> >>>> #openMPI hostfile for csclprd3 >>>> #max slots prevents oversubscribing csclprd3-0-9 >>>> csclprd3-0-0 slots=12 max-slots=12 >>>> csclprd3-0-1 slots=6 max-slots=6 >>>> csclprd3-0-2 slots=6 max-slots=6 >>>> csclprd3-0-3 slots=6 max-slots=6 >>>> csclprd3-0-4 slots=6 max-slots=6 >>>> csclprd3-0-5 slots=6 max-slots=6 >>>> csclprd3-0-6 slots=6 max-slots=6 >>>> csclprd3-0-7 slots=32 max-slots=32 >>>> csclprd3-0-8 slots=32 max-slots=32 >>>> csclprd3-0-9 slots=32 max-slots=32 >>>> csclprd3-0-10 slots=32 max-slots=32 >>>> csclprd3-0-11 slots=32 max-slots=32 >>>> csclprd3-0-12 slots=12 max-slots=12 >>>> csclprd3-0-13 slots=24 max-slots=24 >>>> csclprd3-0-14 slots=16 max-slots=16 >>>> csclprd3-0-15 slots=16 max-slots=16 >>>> csclprd3-0-16 slots=24 max-slots=24 >>>> csclprd3-0-17 slots=24 max-slots=24 >>>> csclprd3-6-1 slots=4 max-slots=4 >>>> csclprd3-6-5 slots=4 max-slots=4 >>>> >>>> The number of slots also includes hyperthreading >>>> cores. >>>> >>>> One more question, would not having defined swap >>>> partitions on all the nodes in the ring cause OpenMPI >>>> to crash? Because no swap partitions are defined >>>> for any of the above systems. >>>> >>>> -Bill L. >>>> >>>> >>>> From: users [users-boun...@open-mpi.org >>>> <mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain >>>> [r...@open-mpi.org <mailto:r...@open-mpi.org>] >>>> Sent: Wednesday, April 01, 2015 5:04 AM >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3 >>>> >>>> The warning about binding to memory is due to not having numactl-devel >>>> installed on the system. The job would still run, but we are warning you >>>> that we cannot bind memory to the same domain as the core where we bind >>>> the process. Can cause poor performance, but not fatal. I forget the name >>>> of the param, but you can tell us to "shut up" :-) >>>> >>>> The other warning/error indicates that we aren't seeing enough cores on >>>> the allocation you gave us via the hostile to support one proc/core - >>>> i.e., we didn't at least 128 cores in the sum of the nodes you told us >>>> about. I take it you were expecting that there were that many or more? >>>> >>>> Ralph >>>> >>>> >>>> On Wed, Apr 1, 2015 at 12:54 AM, Lane, William <william.l...@cshs.org >>>> <mailto:william.l...@cshs.org>> wrote: >>>> I'm having problems running OpenMPI jobs >>>> (using a hostfile) on an HPC cluster running >>>> ROCKS on CentOS 6.3. I'm running OpenMPI >>>> outside of Sun Grid Engine (i.e. it is not submitted >>>> as a job to SGE). The program being run is a LAPACK >>>> benchmark. The commandline parameter I'm >>>> using to run the jobs is: >>>> >>>> $MPI_DIR/bin/mpirun -np $NSLOTS -bind-to-core -report-bindings --hostfile >>>> hostfile --mca btl_tcp_if_include eth0 --prefix $MPI_DIR >>>> $BENCH_DIR/$APP_DIR/$APP_BIN >>>> >>>> Where MPI_DIR=/hpc/apps/mpi/openmpi/1.8.2/ >>>> NSLOTS=128 >>>> >>>> I'm getting errors of the form and OpenMPI never runs the LAPACK benchmark: >>>> >>>> >>>> -------------------------------------------------------------------------- >>>> WARNING: a request was made to bind a process. While the system >>>> supports binding the process itself, at least one node does NOT >>>> support binding memory to the process location. >>>> >>>> Node: csclprd3-0-11 >>>> >>>> This usually is due to not having the required NUMA support installed >>>> on the node. In some Linux distributions, the required support is >>>> contained in the libnumactl and libnumactl-devel packages. >>>> This is a warning only; your job will continue, though performance may >>>> be degraded. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> >>>> -------------------------------------------------------------------------- >>>> A request was made to bind to that would result in binding more >>>> processes than cpus on a resource: >>>> >>>> Bind to: CORE >>>> Node: csclprd3-0-11 >>>> #processes: 2 >>>> #cpus: 1 >>>> >>>> You can override this protection by adding the "overload-allowed" >>>> option to your binding directive. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> The only installed numa packages are: >>>> numactl.x86_64 2.0.7-3.el6 >>>> @centos6.3-x86_64-0/$ >>>> >>>> When I search for the available NUMA packages I find: >>>> >>>> yum search numa | less >>>> >>>> Loaded plugins: fastestmirror >>>> Loading mirror speeds from cached hostfile >>>> ============================== N/S Matched: numa >>>> =============================== >>>> numactl-devel.i686 : Development package for building Applications >>>> that use numa >>>> numactl-devel.x86_64 : Development package for building >>>> Applications that use >>>> : numa >>>> numad.x86_64 : NUMA user daemon >>>> numactl.i686 : Library for tuning for Non Uniform Memory Access >>>> machines >>>> numactl.x86_64 : Library for tuning for Non Uniform Memory Access >>>> machines >>>> >>>> Do I need to install additional and/or different NUMA packages in order to >>>> get OpenMPI to work >>>> on this cluster? >>>> >>>> -Bill Lane >>>> IMPORTANT WARNING: This message is intended for the use of the person or >>>> entity to which it is addressed and may contain information that is >>>> privileged and confidential, the disclosure of which is governed by >>>> applicable law. If the reader of this message is not the intended >>>> recipient, or the employee or agent responsible for delivering it to the >>>> intended recipient, you are hereby notified that any dissemination, >>>> distribution or copying of this information is strictly prohibited. Thank >>>> you for your cooperation. >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> Searchable archives: >>>> http://www.open-mpi.org/community/lists/users/2015/04/index.php >>>> <http://www.open-mpi.org/community/lists/users/2015/04/index.php> >>>> >>>> IMPORTANT WARNING: This message is intended for the use of the person or >>>> entity to which it is addressed and may contain information that is >>>> privileged and confidential, the disclosure of which is governed by >>>> applicable law. If the reader of this message is not the intended >>>> recipient, or the employee or agent responsible for delivering it to the >>>> intended recipient, you are hereby notified that any dissemination, >>>> distribution or copying of this information is strictly prohibited. Thank >>>> you for your cooperation. >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/04/26589.php >>>> <http://www.open-mpi.org/community/lists/users/2015/04/26589.php> >>>> >>>> IMPORTANT WARNING: This message is intended for the use of the person or >>>> entity to which it is addressed and may contain information that is >>>> privileged and confidential, the disclosure of which is governed by >>>> applicable law. If the reader of this message is not the intended >>>> recipient, or the employee or agent responsible for delivering it to the >>>> intended recipient, you are hereby notified that any dissemination, >>>> distribution or copying of this information is strictly prohibited. Thank >>>> you for your cooperation. _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/04/26611.php >>>> <http://www.open-mpi.org/community/lists/users/2015/04/26611.php> >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is strictly prohibited. Thank >>> you for your cooperation. _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/04/26618.php >>> <http://www.open-mpi.org/community/lists/users/2015/04/26618.php> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation. _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/04/26643.php >> <http://www.open-mpi.org/community/lists/users/2015/04/26643.php> > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is strictly prohibited. Thank you for your > cooperation. > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is strictly prohibited. Thank you for your > cooperation._______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26656.php > <http://www.open-mpi.org/community/lists/users/2015/04/26656.php>