> On Apr 8, 2015, at 10:20 AM, Lane, William <william.l...@cshs.org> wrote:
> 
> Ralph,
> 
> I just wanted to add that roughly a year ago I was fighting w/these
> same issues, but was re-tasked to more pressing issues and had to
> abandon looking into these OpenMPI 1.8.2 issues on our CentOS 6.3
> cluster.
> 
> In any case, in digging around I found you had the following
> recommendation back then:
> 
> > Argh - yeah, I got confused as things context switched a few too many 
> > times. The 1.8.2 release should certainly understand that arrangement, and 
> > --hetero-nodes. The only way it wouldn't see the latter is if you configure 
> > it --without-hwloc, or hwloc refused to build. 

I believe we fixed those issues

> > 
> > Since there was a question about the numactl-devel requirement, I suspect 
> > that is the root cause of all evil in this case and the lack of 
> > --hetero-nodes would confirm that diagnosis :-) 
> 
> So the numactl-devel library is required for OpenMPI to function on NUMA
> nodes? Or maybe just NUMA nodes that also have hyperthreading capabilities?

Binding in general requires numactl-devel, whether to HT or non-HT nodes

> 
> Bill L.
> 
> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] 
> on behalf of Lane, William [william.l...@cshs.org 
> <mailto:william.l...@cshs.org>]
> Sent: Wednesday, April 08, 2015 9:29 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
> 
> Ralph,
> 
> Thanks for YOUR help,  I never
> would've managed to get the LAPACK
> benchmark running on more than one
> node in our cluster without your help.
> 
> Ralph, is hyperthreading more of a curse
> than an advantage for HPC applications?
> 
> I'm going to go through all the OpenMPI 
> articles on hyperthreading and NUMA to
> see if that will shed any light on these
> issues.
> 
> -Bill L.
> 
> 
> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] 
> on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>]
> Sent: Tuesday, April 07, 2015 7:32 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
> 
> I’m not sure our man pages are good enough to answer your question, but here 
> is the URL
> 
> http://www.open-mpi.org/doc/v1.8/ <http://www.open-mpi.org/doc/v1.8/>
> 
> I’m a tad tied up right now, but I’ll try to address this prior to 1.8.5 
> release. Thanks for all that debug effort! Helps a bunch.
> 
>> On Apr 7, 2015, at 1:17 PM, Lane, William <william.l...@cshs.org 
>> <mailto:william.l...@cshs.org>> wrote:
>> 
>> Ralph,
>> 
>> I've finally had some luck using the following:
>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single 
>> --mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix 
>> $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN
>> 
>> Where $NSLOTS was 56 and my hostfile hostfile-single is:
>> 
>> csclprd3-0-0 slots=12 max-slots=24
>> csclprd3-0-1 slots=6 max-slots=12
>> csclprd3-0-2 slots=6 max-slots=12
>> csclprd3-0-3 slots=6 max-slots=12
>> csclprd3-0-4 slots=6 max-slots=12
>> csclprd3-0-5 slots=6 max-slots=12
>> csclprd3-0-6 slots=6 max-slots=12
>> csclprd3-6-1 slots=4 max-slots=4
>> csclprd3-6-5 slots=4 max-slots=4
>> 
>> The max-slots differs from slots on some nodes
>> because I include the hyperthreaded cores in
>> the max-slots, the last two nodes have CPU's that
>> don't support hyperthreading at all.
>> 
>> Does --use-hwthread-cpus prevent slots from
>> being assigned to hyperthreading cores?
>> 
>> For some reason the manpage for OpenMPI 1.8.2
>> isn't installed on our CentOS 6.3 systems is there a
>> URL I can I find a copy of the manpages for OpenMPI 1.8.2?
>> 
>> Thanks for your help,
>> 
>> -Bill Lane
>> 
>> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] 
>> on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>]
>> Sent: Monday, April 06, 2015 1:39 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
>> 
>> Hmmm…well, that shouldn’t be the issue. To check, try running it with 
>> “bind-to none”. If you can get a backtrace telling us where it is crashing, 
>> that would also help.
>> 
>> 
>>> On Apr 6, 2015, at 12:24 PM, Lane, William <william.l...@cshs.org 
>>> <mailto:william.l...@cshs.org>> wrote:
>>> 
>>> Ralph,
>>> 
>>> For the following two different commandline invocations of the LAPACK 
>>> benchmark
>>> 
>>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile 
>>> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes 
>>> --use-hwthread-cpus --bind-to hwthread --prefix $MPI_DIR 
>>> $BENCH_DIR/$APP_DIR/$APP_BIN
>>> 
>>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile 
>>> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes 
>>> --bind-to-core --prefix $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN
>>> 
>>> I'm receiving the same kinds of OpenMPI error messages (but for different 
>>> nodes in the ring):
>>> 
>>>         [csclprd3-0-16:25940] *** Process received signal ***
>>>         [csclprd3-0-16:25940] Signal: Bus error (7)
>>>         [csclprd3-0-16:25940] Signal code: Non-existant physical address (2)
>>>         [csclprd3-0-16:25940] Failing at address: 0x7f8b1b5a2600
>>> 
>>>         
>>> --------------------------------------------------------------------------
>>>         mpirun noticed that process rank 82 with PID 25936 on node 
>>> csclprd3-0-16 exited on signal 7 (Bus error).
>>>         
>>> --------------------------------------------------------------------------
>>>         16 total processes killed (some possibly by mpirun during cleanup)
>>> 
>>> It seems to occur on systems that have more than one, physical CPU 
>>> installed. Could
>>> this be due to a lack of the correct NUMA libraries being installed?
>>> 
>>> -Bill L.
>>> 
>>> From: users [users-boun...@open-mpi.org 
>>> <mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain 
>>> [r...@open-mpi.org <mailto:r...@open-mpi.org>]
>>> Sent: Sunday, April 05, 2015 6:09 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
>>> 
>>> 
>>>> On Apr 5, 2015, at 5:58 PM, Lane, William <william.l...@cshs.org 
>>>> <mailto:william.l...@cshs.org>> wrote:
>>>> 
>>>> I think some of the Intel Blade systems in the cluster are
>>>> dual core, but don't support hyperthreading. Maybe it
>>>> would be better to exclude hyperthreading altogether
>>>> from submitted OpenMPI jobs?
>>> 
>>> Yes - or you can add "--hetero-nodes -use-hwthread-cpus --bind-to hwthread" 
>>> to the cmd line. This tells mpirun that the nodes aren't all the same, and 
>>> so it has to look at each node's topology instead of taking the first node 
>>> as the template for everything. The second tells it to use the HTs as 
>>> independent cpus where they are supported.
>>> 
>>> I'm not entirely sure the suggestion will work - if we hit a place where HT 
>>> isn't supported, we may balk at being asked to bind to HTs. I can probably 
>>> make a change that supports this kind of hetero arrangement (perhaps 
>>> something like bind-to pu) - might make it into 1.8.5 (we are just starting 
>>> the release process on it now).
>>> 
>>>> 
>>>> OpenMPI doesn't crash, but it doesn't run the LAPACK
>>>> benchmark either.
>>>> 
>>>> Thanks again Ralph.
>>>> 
>>>> Bill L.
>>>> 
>>>> From: users [users-boun...@open-mpi.org 
>>>> <mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain 
>>>> [r...@open-mpi.org <mailto:r...@open-mpi.org>]
>>>> Sent: Wednesday, April 01, 2015 8:40 AM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
>>>> 
>>>> Bingo - you said the magic word. This is a terminology issue. When we say 
>>>> "core", we mean the old definition of "core", not "hyperthreads". If you 
>>>> want to use HTs as your base processing unit and bind to them, then you 
>>>> need to specify --bind-to hwthread. That warning should then go away.
>>>> 
>>>> We don't require a swap region be mounted - I didn't see anything in your 
>>>> original message indicating that OMPI had actually crashed, but just 
>>>> wasn't launching due to the above issue. Were you actually seeing crashes 
>>>> as well?
>>>> 
>>>> 
>>>> On Wed, Apr 1, 2015 at 8:31 AM, Lane, William <william.l...@cshs.org 
>>>> <mailto:william.l...@cshs.org>> wrote:
>>>> Ralph,
>>>> 
>>>> Here's the associated hostfile:
>>>> 
>>>> #openMPI hostfile for csclprd3
>>>> #max slots prevents oversubscribing csclprd3-0-9
>>>> csclprd3-0-0 slots=12 max-slots=12
>>>> csclprd3-0-1 slots=6 max-slots=6
>>>> csclprd3-0-2 slots=6 max-slots=6
>>>> csclprd3-0-3 slots=6 max-slots=6
>>>> csclprd3-0-4 slots=6 max-slots=6
>>>> csclprd3-0-5 slots=6 max-slots=6
>>>> csclprd3-0-6 slots=6 max-slots=6
>>>> csclprd3-0-7 slots=32 max-slots=32
>>>> csclprd3-0-8 slots=32 max-slots=32
>>>> csclprd3-0-9 slots=32 max-slots=32
>>>> csclprd3-0-10 slots=32 max-slots=32
>>>> csclprd3-0-11 slots=32 max-slots=32
>>>> csclprd3-0-12 slots=12 max-slots=12
>>>> csclprd3-0-13 slots=24 max-slots=24
>>>> csclprd3-0-14 slots=16 max-slots=16
>>>> csclprd3-0-15 slots=16 max-slots=16
>>>> csclprd3-0-16 slots=24 max-slots=24
>>>> csclprd3-0-17 slots=24 max-slots=24
>>>> csclprd3-6-1 slots=4 max-slots=4
>>>> csclprd3-6-5 slots=4 max-slots=4
>>>> 
>>>> The number of slots also includes hyperthreading
>>>> cores.
>>>> 
>>>> One more question, would not having defined swap
>>>> partitions on all the nodes in the ring cause OpenMPI
>>>> to crash? Because no swap partitions are defined
>>>> for any of the above systems.
>>>> 
>>>> -Bill L.
>>>> 
>>>> 
>>>> From: users [users-boun...@open-mpi.org 
>>>> <mailto:users-boun...@open-mpi.org>] on behalf of Ralph Castain 
>>>> [r...@open-mpi.org <mailto:r...@open-mpi.org>]
>>>> Sent: Wednesday, April 01, 2015 5:04 AM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
>>>> 
>>>> The warning about binding to memory is due to not having numactl-devel 
>>>> installed on the system. The job would still run, but we are warning you 
>>>> that we cannot bind memory to the same domain as the core where we bind 
>>>> the process. Can cause poor performance, but not fatal. I forget the name 
>>>> of the param, but you can tell us to "shut up" :-)
>>>> 
>>>> The other warning/error indicates that we aren't seeing enough cores on 
>>>> the allocation you gave us via the hostile to support one proc/core - 
>>>> i.e., we didn't at least 128 cores in the sum of the nodes you told us 
>>>> about. I take it you were expecting that there were that many or more?
>>>> 
>>>> Ralph
>>>> 
>>>> 
>>>> On Wed, Apr 1, 2015 at 12:54 AM, Lane, William <william.l...@cshs.org 
>>>> <mailto:william.l...@cshs.org>> wrote:
>>>> I'm having problems running OpenMPI jobs
>>>> (using a hostfile) on an HPC cluster running
>>>> ROCKS on CentOS 6.3. I'm running OpenMPI
>>>> outside of Sun Grid Engine (i.e. it is not submitted
>>>> as a job to SGE). The program being run is a LAPACK
>>>> benchmark. The commandline parameter I'm 
>>>> using to run the jobs is:
>>>> 
>>>> $MPI_DIR/bin/mpirun -np $NSLOTS -bind-to-core -report-bindings --hostfile 
>>>> hostfile --mca btl_tcp_if_include eth0 --prefix $MPI_DIR 
>>>> $BENCH_DIR/$APP_DIR/$APP_BIN
>>>> 
>>>> Where MPI_DIR=/hpc/apps/mpi/openmpi/1.8.2/
>>>> NSLOTS=128
>>>> 
>>>> I'm getting errors of the form and OpenMPI never runs the LAPACK benchmark:
>>>> 
>>>>    
>>>> --------------------------------------------------------------------------
>>>>    WARNING: a request was made to bind a process. While the system
>>>>    supports binding the process itself, at least one node does NOT
>>>>    support binding memory to the process location.
>>>> 
>>>>     Node:  csclprd3-0-11
>>>> 
>>>>    This usually is due to not having the required NUMA support installed
>>>>    on the node. In some Linux distributions, the required support is
>>>>    contained in the libnumactl and libnumactl-devel packages.
>>>>    This is a warning only; your job will continue, though performance may 
>>>> be degraded.
>>>>    
>>>> --------------------------------------------------------------------------
>>>> 
>>>>    
>>>> --------------------------------------------------------------------------
>>>>    A request was made to bind to that would result in binding more
>>>>    processes than cpus on a resource:
>>>> 
>>>>       Bind to:     CORE
>>>>       Node:        csclprd3-0-11
>>>>       #processes:  2
>>>>       #cpus:       1
>>>> 
>>>>    You can override this protection by adding the "overload-allowed"
>>>>    option to your binding directive.
>>>>    
>>>> --------------------------------------------------------------------------
>>>> 
>>>> The only installed numa packages are:
>>>> numactl.x86_64                                                2.0.7-3.el6  
>>>>                       @centos6.3-x86_64-0/$
>>>> 
>>>> When I search for the available NUMA packages I find:
>>>> 
>>>> yum search numa | less
>>>> 
>>>>         Loaded plugins: fastestmirror
>>>>         Loading mirror speeds from cached hostfile
>>>>         ============================== N/S Matched: numa 
>>>> ===============================
>>>>         numactl-devel.i686 : Development package for building Applications 
>>>> that use numa
>>>>         numactl-devel.x86_64 : Development package for building 
>>>> Applications that use
>>>>                              : numa
>>>>         numad.x86_64 : NUMA user daemon
>>>>         numactl.i686 : Library for tuning for Non Uniform Memory Access 
>>>> machines
>>>>         numactl.x86_64 : Library for tuning for Non Uniform Memory Access 
>>>> machines
>>>> 
>>>> Do I need to install additional and/or different NUMA packages in order to 
>>>> get OpenMPI to work
>>>> on this cluster?
>>>> 
>>>> -Bill Lane
>>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>>> entity to which it is addressed and may contain information that is 
>>>> privileged and confidential, the disclosure of which is governed by 
>>>> applicable law. If the reader of this message is not the intended 
>>>> recipient, or the employee or agent responsible for delivering it to the 
>>>> intended recipient, you are hereby notified that any dissemination, 
>>>> distribution or copying of this information is strictly prohibited. Thank 
>>>> you for your cooperation.
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Searchable archives: 
>>>> http://www.open-mpi.org/community/lists/users/2015/04/index.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/04/index.php>
>>>> 
>>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>>> entity to which it is addressed and may contain information that is 
>>>> privileged and confidential, the disclosure of which is governed by 
>>>> applicable law. If the reader of this message is not the intended 
>>>> recipient, or the employee or agent responsible for delivering it to the 
>>>> intended recipient, you are hereby notified that any dissemination, 
>>>> distribution or copying of this information is strictly prohibited. Thank 
>>>> you for your cooperation. 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/04/26589.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26589.php>
>>>> 
>>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>>> entity to which it is addressed and may contain information that is 
>>>> privileged and confidential, the disclosure of which is governed by 
>>>> applicable law. If the reader of this message is not the intended 
>>>> recipient, or the employee or agent responsible for delivering it to the 
>>>> intended recipient, you are hereby notified that any dissemination, 
>>>> distribution or copying of this information is strictly prohibited. Thank 
>>>> you for your cooperation. _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/04/26611.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/04/26611.php>
>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by 
>>> applicable law. If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is strictly prohibited. Thank 
>>> you for your cooperation. _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/04/26618.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/04/26618.php>
>> IMPORTANT WARNING: This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by 
>> applicable law. If the reader of this message is not the intended recipient, 
>> or the employee or agent responsible for delivering it to the intended 
>> recipient, you are hereby notified that any dissemination, distribution or 
>> copying of this information is strictly prohibited. Thank you for your 
>> cooperation. _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/04/26643.php 
>> <http://www.open-mpi.org/community/lists/users/2015/04/26643.php>
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is strictly prohibited. Thank you for your 
> cooperation.
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is strictly prohibited. Thank you for your 
> cooperation._______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26656.php 
> <http://www.open-mpi.org/community/lists/users/2015/04/26656.php>

Reply via email to