Re: [OMPI users] 1.8.3 executable with 1.8.4 mpirun/orted?

2015-04-08 Thread Ralph Castain
Hmmm…yeah, we’ve been discussing this point. It’s a bit of a mixed bag. We hit 
problems where people don’t get their paths set correctly on remote machines, 
and then we hang because of bad connections between incompatible versions. Same 
time, we hit these situations.

We’re getting ready to release 1.8.5 - let me discuss with the team about what 
we can/should do to resolve these problems.


> On Apr 7, 2015, at 8:43 PM, Alan Wild  wrote:
> 
> I know this isn't "recommend", but a vendor recently gave me an executable 
> compiled openmpi-1.8.3 and I happened to have recently completed a build of 
> 1.8.4 (but didn't have 1.8.3 sitting around and the vendor refuses to provide 
> his build).
> 
> Since these releases are so close they should be ABI compatible so I thought 
> I would see what happens...
> 
> [arwild1@hplcslsp2 ~]$ mpirun -n 2 -H localhost vendor_app_mpi
> [hplcslsp2:11394] [[56032,0],0] tcp_peer_recv_connect_ack: received different 
> version from [[56032,1],0]: 1.8.3 instead of 1.8.4
> [hplcslsp2:11394] [[56032,0],0] tcp_peer_recv_connect_ack: received different 
> version from [[56032,1],1]: 1.8.3 instead of 1.8.4
> 
> and then everything hangs.  I can clearly see the output coming from 
> 
> ./orte/mca/oob/tcp/oob_tcp_connection.c
> 
> and where it returns
> 
> return ORTE_ERR_CONNECTION_REFUSED;
> 
> 
> So it looks like I'm going to have to at least build 1.8.3, but is there any 
> way to work around this given we are dealing with builds that are that close? 
>  I'm really not interested in "rolling back" to 1.8.3 or providing both 
> releases on my system.  
> 
> (yes, "right answer" is to get the vendor to provide his build... long stoy)
> 
> -Alan
> 
> 
> 
> -- 
> a...@madllama.net  http://humbleville.blogspot.com 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26645.php



[OMPI users] parsability of ompi_info --parsable output

2015-04-08 Thread Lev Givon
The output of ompi_info --parsable is somewhat difficult to parse
programmatically because it doesn't escape or quote fields that contain colons,
e.g.,

build:timestamp:Tue Dec 23 15:47:28 EST 2014
option:threads:posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI 
progress: no, ORTE progress: yes, Event lib: yes)

Is there some way to facilitate machine parsing of the output of ompi_info
without having to special-case those options/parameters whose data fields might
contain colons ? If not, it would be nice to quote such fields in
future releases of ompi_info.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] OpenMPI 1.8.4 - Java Library - allToAllv()

2015-04-08 Thread Ralph Castain
In the interim, perhaps another way of addressing this would be to ask: what 
happens when you run your reproducer with MPICH? Does that work?

This would at least tell us how another implementation interpreted that 
function.


> On Apr 7, 2015, at 10:18 AM, Ralph Castain  wrote:
> 
> I’m afraid we’ll have to get someone from the Forum to interpret (Howard is a 
> member as well), but here is what I see just below that, in the description 
> section:
> 
> The type signature associated with sendcounts[j], sendtype at process i must 
> be equal to the type signature associated with recvcounts[i], recvtype at 
> process j. This implies that the amount of data sent must be equal to the 
> amount of data received, pairwise between every pair of processes
> 
> 
>> On Apr 7, 2015, at 9:56 AM, Hamidreza Anvari > > wrote:
>> 
>> Hello,
>> 
>> Thanks for your description.
>> I'm currently doing allToAll() prior to allToAllV(), to communicate length 
>> of expected messages.
>> .
>> BUT, I still strongly believe that the right implementation of this method 
>> is something that I expected earlier!
>> If you check the MPI specification here:
>> 
>> http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf 
>> 
>> Page 170
>> Line 14
>> 
>> It is mentioned that "... the number of elements that CAN be received...". 
>> which implies that the actual received message may have shorter length.
>> 
>> While in cases where it is mandatory to have same value, the modal "MUST" is 
>> used. for example at page 171 Line 1, it is mentioned that "... sendtype at 
>> process i MUST be equal to the type signature ...".
>> 
>> SO, I would expect that any consistent implementation of MPI specification 
>> handle this message length matching by itself, as I asked originally.
>> 
>> Thanks,
>> -- HR
>> 
>> On Tue, Apr 7, 2015 at 6:03 AM, Howard Pritchard > > wrote:
>> Hi HR,
>> 
>> Sorry for not noticing the receive side earlier, but as Ralph implied earlier
>> in this thread, the MPI standard has more strict type matching for 
>> collectives
>> than for point to point.  Namely, the number of bytes the receiver expects
>> to receive from a given sender in the alltoallv must match the number of 
>> bytes
>> sent by the sender.
>> 
>> You were just getting lucky with the older open mpi.  The error message
>> isn't so great though.  Its likely in the newer open mpi you are using a
>> collective algorithm for alltoallv that assumes you're app is obeying the
>> standard.  
>> 
>> You are correct that if the ranks don't know how much data will be sent
>> to them from each rank prior to the alltoallv op, you will need to have some
>> mechanism for exchanging this info prior to the alltoallv op.
>> 
>> Howard
>> 
>> 
>> 2015-04-06 23:23 GMT-06:00 Hamidreza Anvari > >:
>> Hello,
>> 
>> If I set the size2 values according to your suggestion, which is the same 
>> values as on sending nodes, it works fine.
>> But by definition it does not need to be exactly the same as the length of 
>> sent data, and it is just a maximum length of expected data to receive. If 
>> not, it is inevitable to run a allToAll() first to communicate the data 
>> sizes, and then doing the main allToAllV(), which is an expensive 
>> unnecessary communication overhead.
>> 
>> I just created a reproducer in C++ which gives the error under OpenMPI 
>> 1.8.4, but runs correctly under OpenMPI 1.5.4 .
>> (I've not included the Java version of this reproducer, which I think is not 
>> important as current version is enough to reproduce the error. But in case, 
>> it is straight forward to convert this code to Java).
>> 
>> Thanks,
>> -- HR
>> 
>> On Mon, Apr 6, 2015 at 3:03 PM, Ralph Castain > > wrote:
>> That would imply that the issue is in the underlying C implementation in 
>> OMPI, not the Java bindings. The reproducer would definitely help pin it 
>> down.
>> 
>> If you change the size2 values to the ones we sent you, does the program by 
>> chance work?
>> 
>> 
>>> On Apr 6, 2015, at 1:44 PM, Hamidreza Anvari >> > wrote:
>>> 
>>> I'll try that as well.
>>> Meanwhile, I found that my c++ code is running fine on a machine running 
>>> OpenMPI 1.5.4, but I receive the same error under OpenMPI 1.8.4 for both 
>>> Java and C++.
>>> 
>>> On Mon, Apr 6, 2015 at 2:21 PM, Howard Pritchard >> > wrote:
>>> Hello HR,
>>> 
>>> Thanks!  If you have Java 1.7 installed on your system would you mind 
>>> trying to test against that version too?
>>> 
>>> Thanks,
>>> 
>>> Howard
>>> 
>>> 
>>> 2015-04-06 13:09 GMT-06:00 Hamidreza Anvari >> >:
>>> Hello,
>>> 
>>> 1. I'm using Java/Javac version 1.8.0_20 under OS X 10.10.2.
>>> 
>>> 2. I have used the following configuration for making OpenMPI:
>>> ./configure --enable-mpi-java 
>>> --with-jdk-bindir="

Re: [OMPI users] OpenMPI 1.8.4 - Java Library - allToAllv()

2015-04-08 Thread Edgar Gabriel
I think the following paragraph might be useful. Its in MPI-3, page 142 
lines 16-20:


"The type-matching conditions for the collective operations are more 
strict than the corresponding conditions between sender and receiver in 
point-to-point. Namely, for collective operations, the amount of data 
sent must exactly match the amount of data specified by the receiver. 
Different type maps (the layout in memory, see Section 4.1) between 
sender and receiver are still allowed".



Thanks
Edgar

On 4/8/2015 9:30 AM, Ralph Castain wrote:

In the interim, perhaps another way of addressing this would be to ask:
what happens when you run your reproducer with MPICH? Does that work?

This would at least tell us how another implementation interpreted that
function.



On Apr 7, 2015, at 10:18 AM, Ralph Castain mailto:r...@open-mpi.org>> wrote:

I’m afraid we’ll have to get someone from the Forum to interpret
(Howard is a member as well), but here is what I see just below that,
in the description section:

/The type signature associated with sendcounts[j], sendtype at
process i must be equal to the type signature associated
with recvcounts[i], recvtype at process j. This implies that the
amount of data sent must be equal to the amount of data received,
pairwise between every pair of processes/



On Apr 7, 2015, at 9:56 AM, Hamidreza Anvari mailto:hr.anv...@gmail.com>> wrote:

Hello,

Thanks for your description.
I'm currently doing allToAll() prior to allToAllV(), to communicate
length of expected messages.
.
BUT, I still strongly believe that the right implementation of this
method is something that I expected earlier!
If you check the MPI specification here:

http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Page 170
Line 14

It is mentioned that "... the number of elements that CAN be
received...". which implies that the actual received message may have
shorter length.

While in cases where it is mandatory to have same value, the modal
"MUST" is used. for example at page 171 Line 1, it is mentioned that
"... sendtype at process i MUST be equal to the type signature ...".

SO, I would expect that any consistent implementation of MPI
specification handle this message length matching by itself, as I
asked originally.

Thanks,
-- HR

On Tue, Apr 7, 2015 at 6:03 AM, Howard Pritchard mailto:hpprit...@gmail.com>> wrote:

Hi HR,

Sorry for not noticing the receive side earlier, but as Ralph
implied earlier
in this thread, the MPI standard has more strict type matching
for collectives
than for point to point.  Namely, the number of bytes the
receiver expects
to receive from a given sender in the alltoallv must match the
number of bytes
sent by the sender.

You were just getting lucky with the older open mpi.  The error
message
isn't so great though.  Its likely in the newer open mpi you are
using a
collective algorithm for alltoallv that assumes you're app is
obeying the
standard.

You are correct that if the ranks don't know how much data will
be sent
to them from each rank prior to the alltoallv op, you will need
to have some
mechanism for exchanging this info prior to the alltoallv op.

Howard


2015-04-06 23:23 GMT-06:00 Hamidreza Anvari mailto:hr.anv...@gmail.com>>:

Hello,

If I set the size2 values according to your suggestion, which
is the same values as on sending nodes, it works fine.
But by definition it does not need to be exactly the same as
the length of sent data, and it is just a maximum length of
expected data to receive. If not, it is inevitable to run a
allToAll() first to communicate the data sizes, and then
doing the main allToAllV(), which is an expensive unnecessary
communication overhead.

I just created a reproducer in C++ which gives the error
under OpenMPI 1.8.4, but runs correctly under OpenMPI 1.5.4 .
(I've not included the Java version of this reproducer, which
I think is not important as current version is enough to
reproduce the error. But in case, it is straight forward to
convert this code to Java).

Thanks,
-- HR

On Mon, Apr 6, 2015 at 3:03 PM, Ralph Castain
mailto:r...@open-mpi.org>> wrote:

That would imply that the issue is in the underlying C
implementation in OMPI, not the Java bindings. The
reproducer would definitely help pin it down.

If you change the size2 values to the ones we sent you,
does the program by chance work?



On Apr 6, 2015, at 1:44 PM, Hamidreza Anvari
mailto:hr.anv...@gmail.com>> wrote:

I'll try that as well.
Meanwhile, I found that my c++ code is running fine on a
machine running OpenMPI 1.5.4, but I receive the same
error under OpenMPI 1.8.4 for both Java and C++.

On

Re: [OMPI users] 1.8.3 executable with 1.8.4 mpirun/orted?

2015-04-08 Thread Ralph Castain
Meantime, I’ve created a patch that should address this problem:

https://github.com/open-mpi/ompi-release/pull/227 


If you can and would like, please see if it resolves this for you.


> On Apr 7, 2015, at 9:29 PM, Ralph Castain  wrote:
> 
> Hmmm…yeah, we’ve been discussing this point. It’s a bit of a mixed bag. We 
> hit problems where people don’t get their paths set correctly on remote 
> machines, and then we hang because of bad connections between incompatible 
> versions. Same time, we hit these situations.
> 
> We’re getting ready to release 1.8.5 - let me discuss with the team about 
> what we can/should do to resolve these problems.
> 
> 
>> On Apr 7, 2015, at 8:43 PM, Alan Wild > > wrote:
>> 
>> I know this isn't "recommend", but a vendor recently gave me an executable 
>> compiled openmpi-1.8.3 and I happened to have recently completed a build of 
>> 1.8.4 (but didn't have 1.8.3 sitting around and the vendor refuses to 
>> provide his build).
>> 
>> Since these releases are so close they should be ABI compatible so I thought 
>> I would see what happens...
>> 
>> [arwild1@hplcslsp2 ~]$ mpirun -n 2 -H localhost vendor_app_mpi
>> [hplcslsp2:11394] [[56032,0],0] tcp_peer_recv_connect_ack: received 
>> different version from [[56032,1],0]: 1.8.3 instead of 1.8.4
>> [hplcslsp2:11394] [[56032,0],0] tcp_peer_recv_connect_ack: received 
>> different version from [[56032,1],1]: 1.8.3 instead of 1.8.4
>> 
>> and then everything hangs.  I can clearly see the output coming from 
>> 
>> ./orte/mca/oob/tcp/oob_tcp_connection.c
>> 
>> and where it returns
>> 
>> return ORTE_ERR_CONNECTION_REFUSED;
>> 
>> 
>> So it looks like I'm going to have to at least build 1.8.3, but is there any 
>> way to work around this given we are dealing with builds that are that 
>> close?  I'm really not interested in "rolling back" to 1.8.3 or providing 
>> both releases on my system.  
>> 
>> (yes, "right answer" is to get the vendor to provide his build... long stoy)
>> 
>> -Alan
>> 
>> 
>> 
>> -- 
>> a...@madllama.net  http://humbleville.blogspot.com 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/04/26645.php
> 



Re: [OMPI users] parsability of ompi_info --parsable output

2015-04-08 Thread Ralph Castain
I think the assumption was that people would parse this as follows:

* entry before the first colon is the category

* entry between first and second colons is the subcategory

* everything past the second colon is the value

You are right, however, that the current format precludes the use of an 
automatic tokenizer looking for colon. I don’t think quoting the value field 
would really solve that problem - do you have any suggestions?


> On Apr 8, 2015, at 7:23 AM, Lev Givon  wrote:
> 
> The output of ompi_info --parsable is somewhat difficult to parse
> programmatically because it doesn't escape or quote fields that contain 
> colons,
> e.g.,
> 
> build:timestamp:Tue Dec 23 15:47:28 EST 2014
> option:threads:posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI 
> progress: no, ORTE progress: yes, Event lib: yes)
> 
> Is there some way to facilitate machine parsing of the output of ompi_info
> without having to special-case those options/parameters whose data fields 
> might
> contain colons ? If not, it would be nice to quote such fields in
> future releases of ompi_info.
> -- 
> Lev Givon
> Bionet Group | Neurokernel Project
> http://www.columbia.edu/~lev/
> http://lebedov.github.io/
> http://neurokernel.github.io/
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26647.php



Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-08 Thread Ralph Castain
Hmmm…could you try 1.8.5rc1? We’ve done some thread-related stuff on it, but we 
may not have solved this level of use just yet. We are working on the new1.9 
series that we hope to make more thread friendly

http://www.open-mpi.org/software/ompi/v1.8/ 



> On Apr 7, 2015, at 11:16 AM, Thomas Klimpel  wrote:
> 
> Here is a stackdump from inside the debugger (because it gives filenames and 
> line numbers):
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7f1eb6bfd700 (LWP 24847)]
> 0x00366aa79252 in _int_malloc () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00366aa79252 in _int_malloc () from /lib64/libc.so.6
> #1  0x00366aa7b7da in _int_realloc () from /lib64/libc.so.6
> #2  0x00366aa7baf5 in realloc () from /lib64/libc.so.6
> #3  0x7f1ee005d0a8 in epoll_dispatch (base=, 
> arg=0x13d1310, tv=)
> at ../../../../../package/openmpi-1.6.5/opal/event/epoll.c:271
> #4  0x7f1ee005f1cf in opal_event_base_loop (base=0x13d1e50, flags= optimized out>)
> at ../../../../../package/openmpi-1.6.5/opal/event/event.c:838
> #5  0x7f1ee00842f9 in opal_progress () at 
> ../../../../package/openmpi-1.6.5/opal/runtime/opal_progress.c:189
> #6  0x7f1ecd43cd7f in mca_pml_ob1_iprobe (src=, 
> tag=-1, comm=0x164dd40, matched=0x7f1eb6bfb8ac, status=0x7f1eb6bfb8b0)
> at 
> ../../../../../../../package/openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_iprobe.c:48
> #7  0x7f1edffe3427 in PMPI_Iprobe (source=227, tag=-1, comm=0x164dd40, 
> flag=, status=)
> at piprobe.c:79
> #8  0x7f1eebb518e7 in OMPIConnection::Receive (this=0x13c7950, 
> rMessage_p=std::vector of length 0, capacity 0, 
> rMessageId_p=@0x7f1eb6bfc26c, NodeId_p=227)
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26642.php



Re: [OMPI users] parsability of ompi_info --parsable output

2015-04-08 Thread Lev Givon
Received from Ralph Castain on Wed, Apr 08, 2015 at 10:46:58AM EDT:
>
> > On Apr 8, 2015, at 7:23 AM, Lev Givon  wrote:
> > 
> > The output of ompi_info --parsable is somewhat difficult to parse
> > programmatically because it doesn't escape or quote fields that contain 
> > colons,
> > e.g.,
> > 
> > build:timestamp:Tue Dec 23 15:47:28 EST 2014
> > option:threads:posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI 
> > progress: no, ORTE progress: yes, Event lib: yes)
> > 
> > Is there some way to facilitate machine parsing of the output of ompi_info
> > without having to special-case those options/parameters whose data fields 
> > might
> > contain colons ? If not, it would be nice to quote such fields in
> > future releases of ompi_info.
>
> I think the assumption was that people would parse this as follows:
> 
> * entry before the first colon is the category
> 
> * entry between first and second colons is the subcategory
> 
> * everything past the second colon is the value

Given that the "value" as defined above may still contain colons, it's still
necessary to process it to extract the various data in it, e.g., the various MCA
parameters, their values, types, etc.

> You are right, however, that the current format precludes the use of an
> automatic tokenizer looking for colon. I don't think quoting the value field
> would really solve that problem - do you have any suggestions?

Why wouldn't quoting the value field address the parsing problem? Quoting a
field that contains colons would effectively permit the output of ompi_info
--parsable to be processed just like a CSV file; most CSV readers seem to
support inclusion of the separator character in data fields via quoting.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] parsability of ompi_info --parsable output

2015-04-08 Thread Ralph Castain
Sounds reasonable - I don’t have time to work thru it right now, but we can 
look at it once Jeff returns as he wrote all that stuff and might see where to 
make the changes more readily than me.

> On Apr 8, 2015, at 8:43 AM, Lev Givon  wrote:
> 
> Received from Ralph Castain on Wed, Apr 08, 2015 at 10:46:58AM EDT:
>> 
>>> On Apr 8, 2015, at 7:23 AM, Lev Givon  wrote:
>>> 
>>> The output of ompi_info --parsable is somewhat difficult to parse
>>> programmatically because it doesn't escape or quote fields that contain 
>>> colons,
>>> e.g.,
>>> 
>>> build:timestamp:Tue Dec 23 15:47:28 EST 2014
>>> option:threads:posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI 
>>> progress: no, ORTE progress: yes, Event lib: yes)
>>> 
>>> Is there some way to facilitate machine parsing of the output of ompi_info
>>> without having to special-case those options/parameters whose data fields 
>>> might
>>> contain colons ? If not, it would be nice to quote such fields in
>>> future releases of ompi_info.
>> 
>> I think the assumption was that people would parse this as follows:
>> 
>> * entry before the first colon is the category
>> 
>> * entry between first and second colons is the subcategory
>> 
>> * everything past the second colon is the value
> 
> Given that the "value" as defined above may still contain colons, it's still
> necessary to process it to extract the various data in it, e.g., the various 
> MCA
> parameters, their values, types, etc.
> 
>> You are right, however, that the current format precludes the use of an
>> automatic tokenizer looking for colon. I don't think quoting the value field
>> would really solve that problem - do you have any suggestions?
> 
> Why wouldn't quoting the value field address the parsing problem? Quoting a
> field that contains colons would effectively permit the output of ompi_info
> --parsable to be processed just like a CSV file; most CSV readers seem to
> support inclusion of the separator character in data fields via quoting.
> -- 
> Lev Givon
> Bionet Group | Neurokernel Project
> http://www.columbia.edu/~lev/
> http://lebedov.github.io/
> http://neurokernel.github.io/
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26653.php



Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-08 Thread Lane, William
Ralph,

Thanks for YOUR help,  I never
would've managed to get the LAPACK
benchmark running on more than one
node in our cluster without your help.

Ralph, is hyperthreading more of a curse
than an advantage for HPC applications?

I'm going to go through all the OpenMPI
articles on hyperthreading and NUMA to
see if that will shed any light on these
issues.

-Bill L.



From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Tuesday, April 07, 2015 7:32 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

I’m not sure our man pages are good enough to answer your question, but here is 
the URL

http://www.open-mpi.org/doc/v1.8/

I’m a tad tied up right now, but I’ll try to address this prior to 1.8.5 
release. Thanks for all that debug effort! Helps a bunch.

On Apr 7, 2015, at 1:17 PM, Lane, William 
mailto:william.l...@cshs.org>> wrote:

Ralph,

I've finally had some luck using the following:
$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single 
--mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix 
$MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN

Where $NSLOTS was 56 and my hostfile hostfile-single is:

csclprd3-0-0 slots=12 max-slots=24
csclprd3-0-1 slots=6 max-slots=12
csclprd3-0-2 slots=6 max-slots=12
csclprd3-0-3 slots=6 max-slots=12
csclprd3-0-4 slots=6 max-slots=12
csclprd3-0-5 slots=6 max-slots=12
csclprd3-0-6 slots=6 max-slots=12
csclprd3-6-1 slots=4 max-slots=4
csclprd3-6-5 slots=4 max-slots=4

The max-slots differs from slots on some nodes
because I include the hyperthreaded cores in
the max-slots, the last two nodes have CPU's that
don't support hyperthreading at all.

Does --use-hwthread-cpus prevent slots from
being assigned to hyperthreading cores?

For some reason the manpage for OpenMPI 1.8.2
isn't installed on our CentOS 6.3 systems is there a
URL I can I find a copy of the manpages for OpenMPI 1.8.2?

Thanks for your help,

-Bill Lane


From: users [users-boun...@open-mpi.org] on 
behalf of Ralph Castain [r...@open-mpi.org]
Sent: Monday, April 06, 2015 1:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

Hmmm…well, that shouldn’t be the issue. To check, try running it with “bind-to 
none”. If you can get a backtrace telling us where it is crashing, that would 
also help.


On Apr 6, 2015, at 12:24 PM, Lane, William 
mailto:william.l...@cshs.org>> wrote:

Ralph,

For the following two different commandline invocations of the LAPACK benchmark

$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-no_slots 
--mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --bind-to 
hwthread --prefix $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN

$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-no_slots 
--mca btl_tcp_if_include eth0 --hetero-nodes --bind-to-core --prefix $MPI_DIR 
$BENCH_DIR/$APP_DIR/$APP_BIN

I'm receiving the same kinds of OpenMPI error messages (but for different nodes 
in the ring):

[csclprd3-0-16:25940] *** Process received signal ***
[csclprd3-0-16:25940] Signal: Bus error (7)
[csclprd3-0-16:25940] Signal code: Non-existant physical address (2)
[csclprd3-0-16:25940] Failing at address: 0x7f8b1b5a2600


--
mpirun noticed that process rank 82 with PID 25936 on node 
csclprd3-0-16 exited on signal 7 (Bus error).

--
16 total processes killed (some possibly by mpirun during cleanup)

It seems to occur on systems that have more than one, physical CPU installed. 
Could
this be due to a lack of the correct NUMA libraries being installed?

-Bill L.


From: users [users-boun...@open-mpi.org] on 
behalf of Ralph Castain [r...@open-mpi.org]
Sent: Sunday, April 05, 2015 6:09 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3


On Apr 5, 2015, at 5:58 PM, Lane, William 
mailto:william.l...@cshs.org>> wrote:

I think some of the Intel Blade systems in the cluster are
dual core, but don't support hyperthreading. Maybe it
would be better to exclude hyperthreading altogether
from submitted OpenMPI jobs?

Yes - or you can add "--hetero-nodes -use-hwthread-cpus --bind-to hwthread" to 
the cmd line. This tells mpirun that the nodes aren't all the same, and so it 
has to look at each node's topology instead of taking the first node as the 
template for everything. The second tells it to use the HTs as independent cpus 
where they are supported.

I'm not entirely sure the suggestion will work - if we hit a place where HT 
isn't supported, we may balk at being asked to

Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-08 Thread Lane, William
Ralph,

I just wanted to add that roughly a year ago I was fighting w/these
same issues, but was re-tasked to more pressing issues and had to
abandon looking into these OpenMPI 1.8.2 issues on our CentOS 6.3
cluster.

In any case, in digging around I found you had the following
recommendation back then:

> Argh - yeah, I got confused as things context switched a few too many times. 
> The 1.8.2 release should certainly understand that arrangement, and 
> --hetero-nodes. The only way it wouldn't see the latter is if you configure 
> it --without-hwloc, or hwloc refused to build.
>
> Since there was a question about the numactl-devel requirement, I suspect 
> that is the root cause of all evil in this case and the lack of 
> --hetero-nodes would confirm that diagnosis :-)

So the numactl-devel library is required for OpenMPI to function on NUMA
nodes? Or maybe just NUMA nodes that also have hyperthreading capabilities?

Bill L.


From: users [users-boun...@open-mpi.org] on behalf of Lane, William 
[william.l...@cshs.org]
Sent: Wednesday, April 08, 2015 9:29 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

Ralph,

Thanks for YOUR help,  I never
would've managed to get the LAPACK
benchmark running on more than one
node in our cluster without your help.

Ralph, is hyperthreading more of a curse
than an advantage for HPC applications?

I'm going to go through all the OpenMPI
articles on hyperthreading and NUMA to
see if that will shed any light on these
issues.

-Bill L.



From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Tuesday, April 07, 2015 7:32 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

I’m not sure our man pages are good enough to answer your question, but here is 
the URL

http://www.open-mpi.org/doc/v1.8/

I’m a tad tied up right now, but I’ll try to address this prior to 1.8.5 
release. Thanks for all that debug effort! Helps a bunch.

On Apr 7, 2015, at 1:17 PM, Lane, William 
mailto:william.l...@cshs.org>> wrote:

Ralph,

I've finally had some luck using the following:
$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single 
--mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix 
$MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN

Where $NSLOTS was 56 and my hostfile hostfile-single is:

csclprd3-0-0 slots=12 max-slots=24
csclprd3-0-1 slots=6 max-slots=12
csclprd3-0-2 slots=6 max-slots=12
csclprd3-0-3 slots=6 max-slots=12
csclprd3-0-4 slots=6 max-slots=12
csclprd3-0-5 slots=6 max-slots=12
csclprd3-0-6 slots=6 max-slots=12
csclprd3-6-1 slots=4 max-slots=4
csclprd3-6-5 slots=4 max-slots=4

The max-slots differs from slots on some nodes
because I include the hyperthreaded cores in
the max-slots, the last two nodes have CPU's that
don't support hyperthreading at all.

Does --use-hwthread-cpus prevent slots from
being assigned to hyperthreading cores?

For some reason the manpage for OpenMPI 1.8.2
isn't installed on our CentOS 6.3 systems is there a
URL I can I find a copy of the manpages for OpenMPI 1.8.2?

Thanks for your help,

-Bill Lane


From: users [users-boun...@open-mpi.org] on 
behalf of Ralph Castain [r...@open-mpi.org]
Sent: Monday, April 06, 2015 1:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

Hmmm…well, that shouldn’t be the issue. To check, try running it with “bind-to 
none”. If you can get a backtrace telling us where it is crashing, that would 
also help.


On Apr 6, 2015, at 12:24 PM, Lane, William 
mailto:william.l...@cshs.org>> wrote:

Ralph,

For the following two different commandline invocations of the LAPACK benchmark

$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-no_slots 
--mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --bind-to 
hwthread --prefix $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN

$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-no_slots 
--mca btl_tcp_if_include eth0 --hetero-nodes --bind-to-core --prefix $MPI_DIR 
$BENCH_DIR/$APP_DIR/$APP_BIN

I'm receiving the same kinds of OpenMPI error messages (but for different nodes 
in the ring):

[csclprd3-0-16:25940] *** Process received signal ***
[csclprd3-0-16:25940] Signal: Bus error (7)
[csclprd3-0-16:25940] Signal code: Non-existant physical address (2)
[csclprd3-0-16:25940] Failing at address: 0x7f8b1b5a2600


--
mpirun noticed that process rank 82 with PID 25936 on node 
csclprd3-0-16 exited on signal 7 (Bus error).

--
16 total processes killed (some possibly by mpirun during cleanup)

It seems to occur on sys

Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-08 Thread Ralph Castain

> On Apr 8, 2015, at 10:20 AM, Lane, William  wrote:
> 
> Ralph,
> 
> I just wanted to add that roughly a year ago I was fighting w/these
> same issues, but was re-tasked to more pressing issues and had to
> abandon looking into these OpenMPI 1.8.2 issues on our CentOS 6.3
> cluster.
> 
> In any case, in digging around I found you had the following
> recommendation back then:
> 
> > Argh - yeah, I got confused as things context switched a few too many 
> > times. The 1.8.2 release should certainly understand that arrangement, and 
> > --hetero-nodes. The only way it wouldn't see the latter is if you configure 
> > it --without-hwloc, or hwloc refused to build. 

I believe we fixed those issues

> > 
> > Since there was a question about the numactl-devel requirement, I suspect 
> > that is the root cause of all evil in this case and the lack of 
> > --hetero-nodes would confirm that diagnosis :-) 
> 
> So the numactl-devel library is required for OpenMPI to function on NUMA
> nodes? Or maybe just NUMA nodes that also have hyperthreading capabilities?

Binding in general requires numactl-devel, whether to HT or non-HT nodes

> 
> Bill L.
> 
> From: users [users-boun...@open-mpi.org ] 
> on behalf of Lane, William [william.l...@cshs.org 
> ]
> Sent: Wednesday, April 08, 2015 9:29 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
> 
> Ralph,
> 
> Thanks for YOUR help,  I never
> would've managed to get the LAPACK
> benchmark running on more than one
> node in our cluster without your help.
> 
> Ralph, is hyperthreading more of a curse
> than an advantage for HPC applications?
> 
> I'm going to go through all the OpenMPI 
> articles on hyperthreading and NUMA to
> see if that will shed any light on these
> issues.
> 
> -Bill L.
> 
> 
> From: users [users-boun...@open-mpi.org ] 
> on behalf of Ralph Castain [r...@open-mpi.org ]
> Sent: Tuesday, April 07, 2015 7:32 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
> 
> I’m not sure our man pages are good enough to answer your question, but here 
> is the URL
> 
> http://www.open-mpi.org/doc/v1.8/ 
> 
> I’m a tad tied up right now, but I’ll try to address this prior to 1.8.5 
> release. Thanks for all that debug effort! Helps a bunch.
> 
>> On Apr 7, 2015, at 1:17 PM, Lane, William > > wrote:
>> 
>> Ralph,
>> 
>> I've finally had some luck using the following:
>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single 
>> --mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix 
>> $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN
>> 
>> Where $NSLOTS was 56 and my hostfile hostfile-single is:
>> 
>> csclprd3-0-0 slots=12 max-slots=24
>> csclprd3-0-1 slots=6 max-slots=12
>> csclprd3-0-2 slots=6 max-slots=12
>> csclprd3-0-3 slots=6 max-slots=12
>> csclprd3-0-4 slots=6 max-slots=12
>> csclprd3-0-5 slots=6 max-slots=12
>> csclprd3-0-6 slots=6 max-slots=12
>> csclprd3-6-1 slots=4 max-slots=4
>> csclprd3-6-5 slots=4 max-slots=4
>> 
>> The max-slots differs from slots on some nodes
>> because I include the hyperthreaded cores in
>> the max-slots, the last two nodes have CPU's that
>> don't support hyperthreading at all.
>> 
>> Does --use-hwthread-cpus prevent slots from
>> being assigned to hyperthreading cores?
>> 
>> For some reason the manpage for OpenMPI 1.8.2
>> isn't installed on our CentOS 6.3 systems is there a
>> URL I can I find a copy of the manpages for OpenMPI 1.8.2?
>> 
>> Thanks for your help,
>> 
>> -Bill Lane
>> 
>> From: users [users-boun...@open-mpi.org ] 
>> on behalf of Ralph Castain [r...@open-mpi.org ]
>> Sent: Monday, April 06, 2015 1:39 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
>> 
>> Hmmm…well, that shouldn’t be the issue. To check, try running it with 
>> “bind-to none”. If you can get a backtrace telling us where it is crashing, 
>> that would also help.
>> 
>> 
>>> On Apr 6, 2015, at 12:24 PM, Lane, William >> > wrote:
>>> 
>>> Ralph,
>>> 
>>> For the following two different commandline invocations of the LAPACK 
>>> benchmark
>>> 
>>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile 
>>> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes 
>>> --use-hwthread-cpus --bind-to hwthread --prefix $MPI_DIR 
>>> $BENCH_DIR/$APP_DIR/$APP_BIN
>>> 
>>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile 
>>> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes 
>>> --bind-to-core --prefix $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN
>>> 
>>> I'm receiving the same kinds of OpenMPI error messages (but for different 
>>> nodes in the ring):
>>> 
>>> [csclprd3-0-16:25940] *** Process 

Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-08 Thread Ralph Castain

> On Apr 8, 2015, at 9:29 AM, Lane, William  wrote:
> 
> Ralph,
> 
> Thanks for YOUR help,  I never
> would've managed to get the LAPACK
> benchmark running on more than one
> node in our cluster without your help.
> 
> Ralph, is hyperthreading more of a curse
> than an advantage for HPC applications?

Wow, you’ll get a lot of argument over that issue! From what I can see, it is 
very application dependent. Some apps appear to benefit, while others can even 
suffer from it.

I think we should support a mix of nodes in this usage, so I’ll try to come up 
with a way to do so.

> 
> I'm going to go through all the OpenMPI 
> articles on hyperthreading and NUMA to
> see if that will shed any light on these
> issues.
> 
> -Bill L.
> 
> 
> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
> [r...@open-mpi.org]
> Sent: Tuesday, April 07, 2015 7:32 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
> 
> I’m not sure our man pages are good enough to answer your question, but here 
> is the URL
> 
> http://www.open-mpi.org/doc/v1.8/ 
> 
> I’m a tad tied up right now, but I’ll try to address this prior to 1.8.5 
> release. Thanks for all that debug effort! Helps a bunch.
> 
>> On Apr 7, 2015, at 1:17 PM, Lane, William > > wrote:
>> 
>> Ralph,
>> 
>> I've finally had some luck using the following:
>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single 
>> --mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix 
>> $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN
>> 
>> Where $NSLOTS was 56 and my hostfile hostfile-single is:
>> 
>> csclprd3-0-0 slots=12 max-slots=24
>> csclprd3-0-1 slots=6 max-slots=12
>> csclprd3-0-2 slots=6 max-slots=12
>> csclprd3-0-3 slots=6 max-slots=12
>> csclprd3-0-4 slots=6 max-slots=12
>> csclprd3-0-5 slots=6 max-slots=12
>> csclprd3-0-6 slots=6 max-slots=12
>> csclprd3-6-1 slots=4 max-slots=4
>> csclprd3-6-5 slots=4 max-slots=4
>> 
>> The max-slots differs from slots on some nodes
>> because I include the hyperthreaded cores in
>> the max-slots, the last two nodes have CPU's that
>> don't support hyperthreading at all.
>> 
>> Does --use-hwthread-cpus prevent slots from
>> being assigned to hyperthreading cores?
>> 
>> For some reason the manpage for OpenMPI 1.8.2
>> isn't installed on our CentOS 6.3 systems is there a
>> URL I can I find a copy of the manpages for OpenMPI 1.8.2?
>> 
>> Thanks for your help,
>> 
>> -Bill Lane
>> 
>> From: users [users-boun...@open-mpi.org ] 
>> on behalf of Ralph Castain [r...@open-mpi.org ]
>> Sent: Monday, April 06, 2015 1:39 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
>> 
>> Hmmm…well, that shouldn’t be the issue. To check, try running it with 
>> “bind-to none”. If you can get a backtrace telling us where it is crashing, 
>> that would also help.
>> 
>> 
>>> On Apr 6, 2015, at 12:24 PM, Lane, William >> > wrote:
>>> 
>>> Ralph,
>>> 
>>> For the following two different commandline invocations of the LAPACK 
>>> benchmark
>>> 
>>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile 
>>> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes 
>>> --use-hwthread-cpus --bind-to hwthread --prefix $MPI_DIR 
>>> $BENCH_DIR/$APP_DIR/$APP_BIN
>>> 
>>> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile 
>>> hostfile-no_slots --mca btl_tcp_if_include eth0 --hetero-nodes 
>>> --bind-to-core --prefix $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN
>>> 
>>> I'm receiving the same kinds of OpenMPI error messages (but for different 
>>> nodes in the ring):
>>> 
>>> [csclprd3-0-16:25940] *** Process received signal ***
>>> [csclprd3-0-16:25940] Signal: Bus error (7)
>>> [csclprd3-0-16:25940] Signal code: Non-existant physical address (2)
>>> [csclprd3-0-16:25940] Failing at address: 0x7f8b1b5a2600
>>> 
>>> 
>>> --
>>> mpirun noticed that process rank 82 with PID 25936 on node 
>>> csclprd3-0-16 exited on signal 7 (Bus error).
>>> 
>>> --
>>> 16 total processes killed (some possibly by mpirun during cleanup)
>>> 
>>> It seems to occur on systems that have more than one, physical CPU 
>>> installed. Could
>>> this be due to a lack of the correct NUMA libraries being installed?
>>> 
>>> -Bill L.
>>> 
>>> From: users [users-boun...@open-mpi.org 
>>> ] on behalf of Ralph Castain 
>>> [r...@open-mpi.org ]
>>> Sent: Sunday, April 05, 2015 6:09 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
>>> 
>>> 
 On Apr 5, 2015, at 5:58 PM, Lane, William >>> 

Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-08 Thread Lane, William
Ralph,

I added one of the newer LGA2011 nodes to my hostfile and
re-ran the benchmark successfully and saw some strange results WRT the
binding directives. Why are hyperthreading cores being used
on the LGA2011 system but not any of other systems which
are mostly hyperthreaded Westmeres)? Isn't the --use-hwthread-cpus
switch supposed to prevent OpenMPI from using hyperthreaded
cores?

OpenMPI LAPACK invocation:

$MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single 
--mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix 
$MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN

Where NSLOTS=72

hostfile:
csclprd3-6-1 slots=4 max-slots=4
csclprd3-6-5 slots=4 max-slots=4
csclprd3-0-0 slots=12 max-slots=24
csclprd3-0-1 slots=6 max-slots=12
csclprd3-0-2 slots=6 max-slots=12
csclprd3-0-3 slots=6 max-slots=12
csclprd3-0-4 slots=6 max-slots=12
csclprd3-0-5 slots=6 max-slots=12
csclprd3-0-6 slots=6 max-slots=12
#total number of successfully tested non-hyperthreaded computes slots at this 
point is 56
csclprd3-0-7 slots=16 max-slots=32

LGA1366 Westmere w/two Intel Xeon X5675 6-core/12-hyperthread CPU's

[csclprd3-0-0:11848] MCW rank 11 bound to socket 1[core 7[hwt 0]]: 
[./././././.][./B/./././.]
[csclprd3-0-0:11848] MCW rank 12 bound to socket 0[core 2[hwt 0]]: 
[././B/././.][./././././.]
[csclprd3-0-0:11848] MCW rank 13 bound to socket 1[core 8[hwt 0]]: 
[./././././.][././B/././.]
[csclprd3-0-0:11848] MCW rank 14 bound to socket 0[core 3[hwt 0]]: 
[./././B/./.][./././././.]
[csclprd3-0-0:11848] MCW rank 15 bound to socket 1[core 9[hwt 0]]: 
[./././././.][./././B/./.]
[csclprd3-0-0:11848] MCW rank 16 bound to socket 0[core 4[hwt 0]]: 
[././././B/.][./././././.]
[csclprd3-0-0:11848] MCW rank 17 bound to socket 1[core 10[hwt 0]]: 
[./././././.][././././B/.]
[csclprd3-0-0:11848] MCW rank 18 bound to socket 0[core 5[hwt 0]]: 
[./././././B][./././././.]
[csclprd3-0-0:11848] MCW rank 19 bound to socket 1[core 11[hwt 0]]: 
[./././././.][./././././B]
[csclprd3-0-0:11848] MCW rank 8 bound to socket 0[core 0[hwt 0]]: 
[B/././././.][./././././.]
[csclprd3-0-0:11848] MCW rank 9 bound to socket 1[core 6[hwt 0]]: 
[./././././.][B/././././.]
[csclprd3-0-0:11848] MCW rank 10 bound to socket 0[core 1[hwt 0]]: 
[./B/./././.][./././././.]

but for the LGA2011 system w/two 8-core/16-hyperthread CPU's

[csclprd3-0-7:30876] MCW rank 60 bound to socket 0[core 2[hwt 0-1]]: 
[../../BB/../../../../..][../../../../../../../..]
[csclprd3-0-7:30876] MCW rank 61 bound to socket 1[core 10[hwt 0-1]]: 
[../../../../../../../..][../../BB/../../../../..]
[csclprd3-0-7:30876] MCW rank 62 bound to socket 0[core 3[hwt 0-1]]: 
[../../../BB/../../../..][../../../../../../../..]
[csclprd3-0-7:30876] MCW rank 63 bound to socket 1[core 11[hwt 0-1]]: 
[../../../../../../../..][../../../BB/../../../..]
[csclprd3-0-7:30876] MCW rank 64 bound to socket 0[core 4[hwt 0-1]]: 
[../../../../BB/../../..][../../../../../../../..]
[csclprd3-0-7:30876] MCW rank 65 bound to socket 1[core 12[hwt 0-1]]: 
[../../../../../../../..][../../../../BB/../../..]
[csclprd3-0-7:30876] MCW rank 66 bound to socket 0[core 5[hwt 0-1]]: 
[../../../../../BB/../..][../../../../../../../..]
[csclprd3-0-7:30876] MCW rank 67 bound to socket 1[core 13[hwt 0-1]]: 
[../../../../../../../..][../../../../../BB/../..]
[csclprd3-0-7:30876] MCW rank 68 bound to socket 0[core 6[hwt 0-1]]: 
[../../../../../../BB/..][../../../../../../../..]
[csclprd3-0-7:30876] MCW rank 69 bound to socket 1[core 14[hwt 0-1]]: 
[../../../../../../../..][../../../../../../BB/..]
[csclprd3-0-7:30876] MCW rank 70 bound to socket 0[core 7[hwt 0-1]]: 
[../../../../../../../BB][../../../../../../../..]
[csclprd3-0-7:30876] MCW rank 71 bound to socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][../../../../../../../BB]
[csclprd3-0-7:30876] MCW rank 56 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../../../..][../../../../../../../..]
[csclprd3-0-7:30876] MCW rank 57 bound to socket 1[core 8[hwt 0-1]]: 
[../../../../../../../..][BB/../../../../../../..]
[csclprd3-0-7:30876] MCW rank 58 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../../../../../..][../../../../../../../..]
[csclprd3-0-7:30876] MCW rank 59 bound to socket 1[core 9[hwt 0-1]]: 
[../../../../../../../..][../BB/../../../../../..]





From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Wednesday, April 08, 2015 10:26 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3


On Apr 8, 2015, at 9:29 AM, Lane, William 
mailto:william.l...@cshs.org>> wrote:

Ralph,

Thanks for YOUR help,  I never
would've managed to get the LAPACK
benchmark running on more than one
node in our cluster without your help.

Ralph, is hyperthreading more of a curse
than an advantage for HPC applications?

Wow, you’ll get a lot of argument over that issue! From what I can see, it is 
very application dependent. Some apps appear to benefit, while others can 

Re: [OMPI users] parsability of ompi_info --parsable output

2015-04-08 Thread Lev Givon
Received from Ralph Castain on Wed, Apr 08, 2015 at 12:23:28PM EDT:

> Sounds reasonable - I don't have time to work thru it right now, but we can
> look at it once Jeff returns as he wrote all that stuff and might see where to
> make the changes more readily than me.

Made a note of the suggestion here:

https://github.com/open-mpi/ompi/issues/515

Thanks,
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-08 Thread Ralph Castain
Just for clarity: does the BIOS on the LGA2011 system have HT enabled?

> On Apr 8, 2015, at 10:55 AM, Lane, William  wrote:
> 
> Ralph,
> 
> I added one of the newer LGA2011 nodes to my hostfile and
> re-ran the benchmark successfully and saw some strange results WRT the
> binding directives. Why are hyperthreading cores being used
> on the LGA2011 system but not any of other systems which
> are mostly hyperthreaded Westmeres)? Isn't the --use-hwthread-cpus
> switch supposed to prevent OpenMPI from using hyperthreaded
> cores?
> 
> OpenMPI LAPACK invocation:
> 
> $MPI_DIR/bin/mpirun -np $NSLOTS --report-bindings --hostfile hostfile-single 
> --mca btl_tcp_if_include eth0 --hetero-nodes --use-hwthread-cpus --prefix 
> $MPI_DIR $BENCH_DIR/$APP_DIR/$APP_BIN
> 
> Where NSLOTS=72
> 
> hostfile:
> csclprd3-6-1 slots=4 max-slots=4
> csclprd3-6-5 slots=4 max-slots=4
> csclprd3-0-0 slots=12 max-slots=24
> csclprd3-0-1 slots=6 max-slots=12
> csclprd3-0-2 slots=6 max-slots=12
> csclprd3-0-3 slots=6 max-slots=12
> csclprd3-0-4 slots=6 max-slots=12
> csclprd3-0-5 slots=6 max-slots=12
> csclprd3-0-6 slots=6 max-slots=12
> #total number of successfully tested non-hyperthreaded computes slots at this 
> point is 56
> csclprd3-0-7 slots=16 max-slots=32
> 
> LGA1366 Westmere w/two Intel Xeon X5675 6-core/12-hyperthread CPU's
> 
> [csclprd3-0-0:11848] MCW rank 11 bound to socket 1[core 7[hwt 0]]: 
> [./././././.][./B/./././.]
> [csclprd3-0-0:11848] MCW rank 12 bound to socket 0[core 2[hwt 0]]: 
> [././B/././.][./././././.]
> [csclprd3-0-0:11848] MCW rank 13 bound to socket 1[core 8[hwt 0]]: 
> [./././././.][././B/././.]
> [csclprd3-0-0:11848] MCW rank 14 bound to socket 0[core 3[hwt 0]]: 
> [./././B/./.][./././././.]
> [csclprd3-0-0:11848] MCW rank 15 bound to socket 1[core 9[hwt 0]]: 
> [./././././.][./././B/./.]
> [csclprd3-0-0:11848] MCW rank 16 bound to socket 0[core 4[hwt 0]]: 
> [././././B/.][./././././.]
> [csclprd3-0-0:11848] MCW rank 17 bound to socket 1[core 10[hwt 0]]: 
> [./././././.][././././B/.]
> [csclprd3-0-0:11848] MCW rank 18 bound to socket 0[core 5[hwt 0]]: 
> [./././././B][./././././.]
> [csclprd3-0-0:11848] MCW rank 19 bound to socket 1[core 11[hwt 0]]: 
> [./././././.][./././././B]
> [csclprd3-0-0:11848] MCW rank 8 bound to socket 0[core 0[hwt 0]]: 
> [B/././././.][./././././.]
> [csclprd3-0-0:11848] MCW rank 9 bound to socket 1[core 6[hwt 0]]: 
> [./././././.][B/././././.]
> [csclprd3-0-0:11848] MCW rank 10 bound to socket 0[core 1[hwt 0]]: 
> [./B/./././.][./././././.]
> 
> but for the LGA2011 system w/two 8-core/16-hyperthread CPU's 
> 
> [csclprd3-0-7:30876] MCW rank 60 bound to socket 0[core 2[hwt 0-1]]: 
> [../../BB/../../../../..][../../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 61 bound to socket 1[core 10[hwt 0-1]]: 
> [../../../../../../../..][../../BB/../../../../..]
> [csclprd3-0-7:30876] MCW rank 62 bound to socket 0[core 3[hwt 0-1]]: 
> [../../../BB/../../../..][../../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 63 bound to socket 1[core 11[hwt 0-1]]: 
> [../../../../../../../..][../../../BB/../../../..]
> [csclprd3-0-7:30876] MCW rank 64 bound to socket 0[core 4[hwt 0-1]]: 
> [../../../../BB/../../..][../../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 65 bound to socket 1[core 12[hwt 0-1]]: 
> [../../../../../../../..][../../../../BB/../../..]
> [csclprd3-0-7:30876] MCW rank 66 bound to socket 0[core 5[hwt 0-1]]: 
> [../../../../../BB/../..][../../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 67 bound to socket 1[core 13[hwt 0-1]]: 
> [../../../../../../../..][../../../../../BB/../..]
> [csclprd3-0-7:30876] MCW rank 68 bound to socket 0[core 6[hwt 0-1]]: 
> [../../../../../../BB/..][../../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 69 bound to socket 1[core 14[hwt 0-1]]: 
> [../../../../../../../..][../../../../../../BB/..]
> [csclprd3-0-7:30876] MCW rank 70 bound to socket 0[core 7[hwt 0-1]]: 
> [../../../../../../../BB][../../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 71 bound to socket 1[core 15[hwt 0-1]]: 
> [../../../../../../../..][../../../../../../../BB]
> [csclprd3-0-7:30876] MCW rank 56 bound to socket 0[core 0[hwt 0-1]]: 
> [BB/../../../../../../..][../../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 57 bound to socket 1[core 8[hwt 0-1]]: 
> [../../../../../../../..][BB/../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 58 bound to socket 0[core 1[hwt 0-1]]: 
> [../BB/../../../../../..][../../../../../../../..]
> [csclprd3-0-7:30876] MCW rank 59 bound to socket 1[core 9[hwt 0-1]]: 
> [../../../../../../../..][../BB/../../../../../..]
> 
> 
> 
> 
> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
> [r...@open-mpi.org]
> Sent: Wednesday, April 08, 2015 10:26 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3
> 
> 
>> On Apr 8, 2015, at 9:29 AM, Lane, William > > wrote:
>> 
>> Ralph,
>> 
>> Thanks for YOUR help,  I never
>> would've managed to g