Re: [OMPI users] Infiniband errors

2012-12-19 Thread Syed Ahsan Ali
Dear John

I found this output of ibstatus on some nodes (most probably the problem
causing)
[root@compute-01-08 ~]# ibstatus
Fatal error:  device '*': sys files not found
(/sys/class/infiniband/*/ports)

Does this show any hardware or software issue?

Thanks




On Wed, Nov 28, 2012 at 3:17 PM, John Hearns  wrote:

> Those diagnostics are from Openfabrics.
> What type of infiniband card do you have?
> What drivers are you using?
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openmpi-1.9a1r27674 on Cygwin-1.7.17

2012-12-19 Thread Siegmar Gross
Hi

> On 12/18/2012 6:55 PM, Jeff Squyres wrote:
> > ...but only of v1.6.x.
> 
> okay, adding development version on Christmas wishlist
> ;-)

Can you build the package with thread and Java support?

  --enable-mpi-java \
  --enable-opal-multi-threads \
  --enable-mpi-thread-multiple \
  --with-threads=posix \

I could build openmpi-1.6.4 with thread support without a problem
for Cygwin 1.7.17 but I failed to build openmpi-1.9 until now.


> > On Dec 18, 2012, at 10:32 AM, Ralph Castain wrote:
> >
> >> Also, be aware that the Cygwin folks have already released a
> >> fully functional port of OMPI to that environment as a package.
> >> So if you want OMPI on Cygwin, you can just download and
> >> install the Cygwin package - no need to build it yourself.


Kind regards

Siegmar



Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Number Cruncher
Having run some more benchmarks, the new default is *really* bad for our 
application (2-10x slower), so I've been looking at the source to try 
and figure out why.


It seems that the biggest difference will occur when the all_to_all is 
actually sparse (e.g. our application); if most N-M process exchanges 
are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear 
algorithm will actually only post irecv/isend for non-zero exchanges; 
any zero-size exchanges are skipped. It then waits once for all requests 
to complete. In contrast, the new 
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size 
exchanges for *every* N-M pair, and wait for each pairwise exchange. 
This is O(comm_size) waits, may of which are zero. I'm not clear what 
optimizations there are for zero-size isend/irecv, but surely there's a 
great deal more latency if each pairwise exchange has to be confirmed 
complete before executing the next?


Relatedly, how would I direct OpenMPI to use the older algorithm 
programmatically? I don't want the user to have to use "--mca" in their 
"mpiexec". Is there a C API?


Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
some is - it depends (usually on the price). This said, not all algorithms
perform the same given a specific type of network interconnect. For example,
on our fat-tree InfiniBand network the pairwise algorithm performs better.

You can switch back to the basic linear algorithm by providing the following
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
is the pairwise one.
  
You can also set  these values as exported environment variables:


export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

I've noticed a very significant (100%) slow down for MPI_Alltoallv calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet
where process-to-process message sizes are fairly small (e.g. 100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over 1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange

typical

of part of our application. This is then exchanged several thousand times.
Output from "mpicc -O3" runs shown below.

My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.

I also

attach a plot showing network throughput on our actual mesh generation
application. Nodes cfsc01-04 are running 1.6.0 and finish within 35

minutes.

Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over

a

hour to run. There seems to be a much greater network demand in the 1.6.1
version, despite the user-code and input data being identical.

Thanks for any help you can give,
Simon





Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Paul Kapinos
Did you *really* wanna to dig into code just in order to switch a default 
communication algorithm?


Note there are several ways to set the parameters; --mca on command line is just 
one of them (suitable for quick online tests).


http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:

Having run some more benchmarks, the new default is *really* bad for our
application (2-10x slower), so I've been looking at the source to try and figure
out why.

It seems that the biggest difference will occur when the all_to_all is actually
sparse (e.g. our application); if most N-M process exchanges are zero in size
the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
only post irecv/isend for non-zero exchanges; any zero-size exchanges are
skipped. It then waits once for all requests to complete. In contrast, the new
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for
*every* N-M pair, and wait for each pairwise exchange. This is O(comm_size)
waits, may of which are zero. I'm not clear what optimizations there are for
zero-size isend/irecv, but surely there's a great deal more latency if each
pairwise exchange has to be confirmed complete before executing the next?

Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in their
"mpiexec". Is there a C API?

Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
some is - it depends (usually on the price). This said, not all algorithms
perform the same given a specific type of network interconnect. For example,
on our fat-tree InfiniBand network the pairwise algorithm performs better.

You can switch back to the basic linear algorithm by providing the following
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
is the pairwise one.
You can also set these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

I've noticed a very significant (100%) slow down for MPI_Alltoallv calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet
where process-to-process message sizes are fairly small (e.g. 100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over 1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange

typical

of part of our application. This is then exchanged several thousand times.
Output from "mpicc -O3" runs shown below.

My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.

I also

attach a plot showing network throughput on our actual mesh generation
application. Nodes cfsc01-04 are running 1.6.0 and finish within 35

minutes.

Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over

a

hour to run. There seems to be a much greater network demand in the 1.6.1
version, despite the user-code and input data being identical.

Thanks for any help you can give,
Simon



___
users mailing list
us...@open-mpi.o

Re: [OMPI users] openmpi-1.9a1r27674 on Cygwin-1.7.17

2012-12-19 Thread marco atzeri

On 12/19/2012 11:04 AM, Siegmar Gross wrote:

Hi


On 12/18/2012 6:55 PM, Jeff Squyres wrote:

...but only of v1.6.x.


okay, adding development version on Christmas wishlist
;-)


Can you build the package with thread and Java support?

   --enable-mpi-java \
   --enable-opal-multi-threads \
   --enable-mpi-thread-multiple \
   --with-threads=posix \

I could build openmpi-1.6.4 with thread support without a problem
for Cygwin 1.7.17 but I failed to build openmpi-1.9 until now.



working on openmpi-1.7rc5.
It needs some cleaning and after I need to test.

java surely no as there is no cygwin Java.

--with-threads=posix  yes

not tested yet
--enable-opal-multi-threads \
--enable-mpi-thread-multiple \





Kind regards

Siegmar



Regards
Marco




Re: [OMPI users] [Open MPI] #3351: JAVA scatter error

2012-12-19 Thread Siegmar Gross
Hi

I shortend this email so that you get earlier to my comments.

> > In my opinion Datatype.Vector must set the size of the
> > base datatype as extent of the vector and not the true extent, because
> > MPI-Java doesn't provide a function to resize a datatype.
> 
> No, I think Datatype.Vector is doing the Right Thing in that it acts
> just like MPI_Type_vector.  We do want these to be *bindings*, after
> all -- meaning that they should be pretty much a 1:1 mapping to the
> C bindings.  
> 
> I think the real shortcoming is that there is no Datatype.Resized
> function.  That can be fixed.

Are you sure? That would at least solve one problem.


> > We should forget
> > ObjectScatterMain.java for the moment and concentrate on
> > ObjectBroadcastMain.java, which I have sent three days ago to the list,
> > because it has the same problem.
> > 
> > 1) ColumnSendRecvMain.java
> > 
> > I create a 2D-matrix with (Java books would use "double[][] matrix"
> > which is the same in my opinion, but I like C notation)
> > 
> > double matrix[][] = new double[P][Q];
> 
> I noticed that if I used [][] in my version of the Scatter program,
> I got random results.  But if I used [] and did my own offset
> indexing, it worked.

I think if you want a 2D-matrix you should use a Java matrix and not
a special one with your own offset indexing. In my opinion that is
something a C programmer can/would do (I'm a C programmer myself with
a little Java knowledge), but the benefit of Java is that the programmer
should not know about addresses, memory layouts and similar things. Now
I sound like my colleagues who always claim that my Java programs look
more like C programs than Java programs :-(. I know nothing about the
memory layout of a Java matrix or if the layout is stable during the
lifetime of the object, but I think that the Java interface should deal
with all these things if that is possible. I suppose that Open MPI will
not succeed in the Java world if it requires "special" matrices and a
special offset indexing. Perhaps some members of this list have very
good Java knowledge or even know the exact layout of Java matrices so
that Datatype.Vector can build a Java column vector from a Java matrix
which even contains valid values.


> If double[][] is a fundamentally different type (and storage format)
> than double[], what is MPI to do?  How can it tell the difference?
> 
> > It is easy to see that process 1 doesn't get column 0. Your
> > suggestion to allocate enough memory for a matrix (without defining
> > a matrix) and doing all index computations yourself is in my opinion
> > not applicable for a "normal" Java programmer (it's even hard for
> > most C programmers :-) ). Hopefully you have an idea how to solve
> > this problem so that all processes receive correct column values.
> 
> I'm afraid I don't, other than defining your own class which
> allocates memory contiguously, but overrides [] and [][]
> (I'm *assuming* you can do that in Java...?).

Does anybody else in this list know how it can be done?


> > 2) ObjectBroadcastMain.java
> > 
> > As I said above, it is my understanding, that I can send a Java object
> > when I use MPI.OBJECT and that the MPI implementation must perform all
> > necessary tasks.
> 
> Remember: there is no standard for MPI and Java.  So there is no
> "must".  :-)

I know and I'm grateful that you try nevertheless to offer a Java
interface. Hopefully you will not misunderstand my "must". It wasn't
complaining, but trying to express that a "normal" Java user would
expect that he can implement an MPI program without special knowledge
about data layouts.


> This is one research implementation that was created.  We can update
> it and try to make it better, but we're somewhat crafting the rules
> as we go along here.
> 
> (BTW, if we continue detailed discussions about implementation,
> this conversation should probably move to the devel list...)
> 
> > Your interface for derived datatypes provides only
> > methods for discontiguous data and no method to create an MPI.OBJECT,
> > so that I have no idea what I would have to do to create one. The
> > object must be serializable so that you get the same values in a
> > heterogeneous environment. 
> > 
> > tyr java 146 mpiexec -np 2 java ObjectBroadcastMain
> > Exception in thread "main" java.lang.ClassCastException:
> >  MyData cannot be cast to [Ljava.lang.Object;
> >at mpi.Comm.Object_Serialize(Comm.java:207)
> >at mpi.Comm.Send(Comm.java:292)
> >at mpi.Intracomm.Bcast(Intracomm.java:202)
> >at ObjectBroadcastMain.main(ObjectBroadcastMain.java:44)
> > ...
> 
> After rooting around in the code a bit, I think I understand this
> stack trace a bit better now..
> 
> The code line in question is in the Object_Serialize method, where
> it calls:
> 
>   Object buf_els [] = (Object[])buf;
> 
> So it's trying to cast an (Object) to an (Object[]).  Apparently,
> this works for intrinsic Java types (e.g., int).  But it does

Re: [OMPI users] Infiniband errors

2012-12-19 Thread Shamis, Pavel
Seems like driver was not started. I would suggest to run lspci and check if 
the HCA is visible on HW level.

Pavel (Pasha) Shamis
---
Computer Science Research Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Dec 19, 2012, at 2:12 AM, Syed Ahsan Ali wrote:

Dear John

I found this output of ibstatus on some nodes (most probably the problem 
causing)
[root@compute-01-08 ~]# ibstatus
Fatal error:  device '*': sys files not found (/sys/class/infiniband/*/ports)

Does this show any hardware or software issue?

Thanks




On Wed, Nov 28, 2012 at 3:17 PM, John Hearns 
mailto:hear...@googlemail.com>> wrote:

Those diagnostics are from Openfabrics.
What type of infiniband card do you have?
What drivers are you using?

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Infiniband errors

2012-12-19 Thread Yann Droneaud
Le mercredi 19 décembre 2012 à 12:12 +0500, Syed Ahsan Ali a écrit :
> Dear John
>  
> I found this output of ibstatus on some nodes (most probably the
> problem causing)
> [root@compute-01-08 ~]# ibstatus
> 
> Fatal error:  device '*': sys files not found
> (/sys/class/infiniband/*/ports)
>  
> Does this show any hardware or software issue?
>  

This is a software issue.

Which Linux (lsb_release --all or cat /etc/redhat-release) and kernel
(uname -a) version are you using ?

Which modules are loaded (lsmod) ?

Is /sys mounted (mount and/or cat /proc/mounts) ?

Regards.

-- 
Yann Droneaud
OPTEYA




Re: [OMPI users] Possible memory error

2012-12-19 Thread Handerson, Steven
Jeff, others:

I fixed the problem we were experiencing by adding a barrier.
The bug occurred between a piece of code that uses (many, over a loop) SEND 
(from the leader)
and RECV (in the worker processes) to ship data to the 
processing nodes from the head / leader, and I think what might have been 
happening is
that this communication was mixed up with the following allreduce, when there's 
no barrier.

The bug shows up in Valgrind and dmalloc as a read from freed memory.

I might spend some time trying to make a small piece of code that reproduces 
this,
but maybe this gives you some idea of what might be the issue,
if it's something that should be fixed.
Some more info: it happens even as far back as openMPI 1.3.4, and even in the 
newest 1.6.3.

Steve



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Saturday, December 15, 2012 7:34 AM
To: Open MPI Users
Subject: Re: [OMPI users] Possible memory error

On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote:

> I'm trying to track down an instance of openMPI writing to a freed block of 
> memory.
> This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit 
> intel architecture, fedora 14.
> It occurs with a very simple reduction (allreduce minimum), over a single int 
> value.

Can you send a reproducer program?  The simpler, the better.

> I'm wondering if the openMPI developers use power tools such as 
> valgrind / dmalloc / etc on the releases to try to catch these things 
> via exhaustive testing - but I understand memory problems in C are of 
> the nature that anyone making a mistake can propogate, so I haven't ruled out 
> problems in our own code.
> Also, I'm wondering if anyone has suggestions on how to track this down 
> further.

Yes, we do use such tools.

Can you cite the specific file/line where the problem is occurring?  The all 
reduce algorithms are fairly self-contained; it should be (relatively) 
straightforward to examine that code and see if there's a problem with the 
memory allocation there.

> I'm using allinea DDT and their builtin dmalloc, which catches the 
> error, which appears in the second memcpy in  opal_convertor_pack(), but I 
> don't have more details than that at the moment.
> All I know so far is that one of those values has been freed.
> Obviously, I haven't seen anything in earlier parts of the code which 
> might have triggered memory corruption, although both openMPI and intel IPP 
> do things with uninitialized values before this (according to Valgrind).

There's a number of issues that can lead to false positives for using 
uninitialized values.  Here's two of the most common cases:

1. When using TCP, one of our data headers has a padding hole in it, but we 
write the whole struct down a TCP socket file descriptor anyway.  Hence, it 
will generate a "read from uninit" warning.

2. When using OpenFabrics-based networks, tool like valgrind don't see the 
OS-bypass initialization of the memory (Which frequently comes directly from 
the hardware), and it generates a lot of false "read from uninit" positives.

One thing you can try is to compile Open MPI --with-valgrind.  This adds a 
little performance penalty, but we take extra steps to eliminate most false 
positives.  It could help separate the wheat from the chaff, in your case.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Number Cruncher

On 19/12/12 11:08, Paul Kapinos wrote:
Did you *really* wanna to dig into code just in order to switch a 
default communication algorithm?


No, I didn't want to, but with a huge change in performance, I'm forced 
to do something! And having looked at the different algorithms, I think 
there's a problem with the new default whenever message sizes are small 
enough that connection latency will dominate. We're not all running 
Infiniband, and having to wait for each pairwise exchange to complete 
before initiating another seems wrong if the latency in waiting for 
completion dominates the transmission time.


E.g. If I have 10 small pairwise exchanges to perform,isn't it better to 
put all 10 outbound messages on the wire, and wait for 10 matching 
inbound messages, in any order? The new algorithm must wait for first 
exchange to complete, then second exchange, then third. Unlike before, 
does it not have to wait to acknowledge the matching *zero-sized* 
request? I don't see why this temporal ordering matters.


Thanks for your help,
Simon






Note there are several ways to set the parameters; --mca on command 
line is just one of them (suitable for quick online tests).


http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:

Having run some more benchmarks, the new default is *really* bad for our
application (2-10x slower), so I've been looking at the source to try 
and figure

out why.

It seems that the biggest difference will occur when the all_to_all 
is actually
sparse (e.g. our application); if most N-M process exchanges are zero 
in size
the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will 
actually
only post irecv/isend for non-zero exchanges; any zero-size exchanges 
are
skipped. It then waits once for all requests to complete. In 
contrast, the new
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size 
exchanges for
*every* N-M pair, and wait for each pairwise exchange. This is 
O(comm_size)
waits, may of which are zero. I'm not clear what optimizations there 
are for
zero-size isend/irecv, but surely there's a great deal more latency 
if each
pairwise exchange has to be confirmed complete before executing the 
next?


Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in their
"mpiexec". Is there a C API?

Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like 
fashion

with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of 
doing so,
some is - it depends (usually on the price). This said, not all 
algorithms
perform the same given a specific type of network interconnect. For 
example,
on our fat-tree InfiniBand network the pairwise algorithm performs 
better.


You can switch back to the basic linear algorithm by providing the 
following

MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. 
Algorithm 2

is the pairwise one.
You can also set these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make 
it have

global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already 
did so.

It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 
1.6.1


I've noticed a very significant (100%) slow down for MPI_Alltoallv 
calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb 
ethernet
where process-to-process message sizes are fairly small (e.g. 
100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default 
algorithm

to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorithm".

I attach a test program which illustrate

Re: [OMPI users] openmpi-1.9a1r27674 on Cygwin-1.7.17

2012-12-19 Thread marco atzeri

On 12/19/2012 12:28 PM, marco atzeri wrote:


working on openmpi-1.7rc5.
It needs some cleaning and after I need to test.


built and passed test
http://www.open-mpi.org/community/lists/devel/2012/12/11855.php

Regards
Marco



Re: [OMPI users] mpi problems/many cpus per node

2012-12-19 Thread Daniel Davidson

I figured this out.

ssh was working, but scp was not due to an mtu mismatch between the 
systems.  Adding MTU=1500 to my 
/etc/sysconfig/network-scripts/ifcfg-eth2 fixed the problem.


Dan

On 12/17/2012 04:12 PM, Daniel Davidson wrote:

Yes, it does.

Dan

[root@compute-2-1 ~]# ssh compute-2-0
Warning: untrusted X11 forwarding setup failed: xauth key data not 
generated
Warning: No xauth data; using fake authentication data for X11 
forwarding.

Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local
[root@compute-2-0 ~]# ssh compute-2-1
Warning: untrusted X11 forwarding setup failed: xauth key data not 
generated
Warning: No xauth data; using fake authentication data for X11 
forwarding.

Last login: Mon Dec 17 16:12:32 2012 from biocluster.local
[root@compute-2-1 ~]#



On 12/17/2012 03:39 PM, Doug Reeder wrote:

Daniel,

Does passwordless ssh work. You need to make sure that it is.

Doug
On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:

I would also add that scp seems to be creating the file in the /tmp 
directory of compute-2-0, and that /var/log secure is showing ssh 
connections being accepted.  Is there anything in ssh that can limit 
connections that I need to look out for?  My guess is that it is 
part of the client prefs and not the server prefs since I can 
initiate the mpi command from another machine and it works fine, 
even when it uses compute-2-0 and 1.


Dan


[root@compute-2-1 /]# date
Mon Dec 17 15:11:50 CST 2012
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname
[compute-2-1.local:70237] mca:base:select:(  plm) Querying component 
[rsh]
[compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on 
agent ssh : rsh path NULL


[root@compute-2-0 tmp]# ls -ltr
total 24
-rw---.  1 rootroot   0 Nov 28 08:42 yum.log
-rw---.  1 rootroot5962 Nov 29 10:50 
yum_save_tx-2012-11-29-10-50SRba9s.yumtx
drwx--.  3 danield danield 4096 Dec 12 14:56 
openmpi-sessions-danield@compute-2-0_0
drwx--.  3 rootroot4096 Dec 13 15:38 
openmpi-sessions-root@compute-2-0_0
drwx--  18 danield danield 4096 Dec 14 09:48 
openmpi-sessions-danield@compute-2-0.local_0
drwx--  44 rootroot4096 Dec 17 15:14 
openmpi-sessions-root@compute-2-0.local_0


[root@compute-2-0 tmp]# tail -10 /var/log/secure
Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root 
from 10.1.255.226 port 49483 ssh2
Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): 
session opened for user root by (uid=0)
Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 
10.1.255.226: 11: disconnected by user
Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): 
session closed for user root
Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root 
from 10.1.255.226 port 49484 ssh2
Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): 
session opened for user root by (uid=0)
Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 
10.1.255.226: 11: disconnected by user
Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): 
session closed for user root
Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root 
from 10.1.255.226 port 49485 ssh2
Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): 
session opened for user root by (uid=0)







On 12/17/2012 11:16 AM, Daniel Davidson wrote:
A very long time (15 mintues or so) I finally received the 
following in addition to what I just sent earlier:


[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc 
working on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc 
working on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc 
working on WILDCARD

[compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
[compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending 
orted_exit commands
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc 
working on WILDCARD
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc 
working on WILDCARD


Firewalls are down:

[root@compute-2-1 /]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination
[root@compute-2-0 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination

On 12/17/2012 11:09 AM, Ralph Castain wrote:
Hmmm...and that is ALL the output? If so, then it never succeeded 
in sending a message back, which leads one to suspect some kind of 
firewall in the way.


Looking at the ssh

Re: [OMPI users] mpi problems/many cpus per node

2012-12-19 Thread Ralph Castain
Hooray!! Great to hear - I was running out of ideas :-)

On Dec 19, 2012, at 2:01 PM, Daniel Davidson  wrote:

> I figured this out.
> 
> ssh was working, but scp was not due to an mtu mismatch between the systems.  
> Adding MTU=1500 to my /etc/sysconfig/network-scripts/ifcfg-eth2 fixed the 
> problem.
> 
> Dan
> 
> On 12/17/2012 04:12 PM, Daniel Davidson wrote:
>> Yes, it does.
>> 
>> Dan
>> 
>> [root@compute-2-1 ~]# ssh compute-2-0
>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>> Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local
>> [root@compute-2-0 ~]# ssh compute-2-1
>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>> Last login: Mon Dec 17 16:12:32 2012 from biocluster.local
>> [root@compute-2-1 ~]#
>> 
>> 
>> 
>> On 12/17/2012 03:39 PM, Doug Reeder wrote:
>>> Daniel,
>>> 
>>> Does passwordless ssh work. You need to make sure that it is.
>>> 
>>> Doug
>>> On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:
>>> 
 I would also add that scp seems to be creating the file in the /tmp 
 directory of compute-2-0, and that /var/log secure is showing ssh 
 connections being accepted.  Is there anything in ssh that can limit 
 connections that I need to look out for?  My guess is that it is part of 
 the client prefs and not the server prefs since I can initiate the mpi 
 command from another machine and it works fine, even when it uses 
 compute-2-0 and 1.
 
 Dan
 
 
 [root@compute-2-1 /]# date
 Mon Dec 17 15:11:50 CST 2012
 [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
 compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
 odls_base_verbose 5 -mca plm_base_verbose 5 hostname
 [compute-2-1.local:70237] mca:base:select:(  plm) Querying component [rsh]
 [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh 
 : rsh path NULL
 
 [root@compute-2-0 tmp]# ls -ltr
 total 24
 -rw---.  1 rootroot   0 Nov 28 08:42 yum.log
 -rw---.  1 rootroot5962 Nov 29 10:50 
 yum_save_tx-2012-11-29-10-50SRba9s.yumtx
 drwx--.  3 danield danield 4096 Dec 12 14:56 
 openmpi-sessions-danield@compute-2-0_0
 drwx--.  3 rootroot4096 Dec 13 15:38 
 openmpi-sessions-root@compute-2-0_0
 drwx--  18 danield danield 4096 Dec 14 09:48 
 openmpi-sessions-danield@compute-2-0.local_0
 drwx--  44 rootroot4096 Dec 17 15:14 
 openmpi-sessions-root@compute-2-0.local_0
 
 [root@compute-2-0 tmp]# tail -10 /var/log/secure
 Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from 
 10.1.255.226 port 49483 ssh2
 Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
 opened for user root by (uid=0)
 Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 
 10.1.255.226: 11: disconnected by user
 Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
 closed for user root
 Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from 
 10.1.255.226 port 49484 ssh2
 Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
 opened for user root by (uid=0)
 Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 
 10.1.255.226: 11: disconnected by user
 Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
 closed for user root
 Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from 
 10.1.255.226 port 49485 ssh2
 Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session 
 opened for user root by (uid=0)
 
 
 
 
 
 
 On 12/17/2012 11:16 AM, Daniel Davidson wrote:
> A very long time (15 mintues or so) I finally received the following in 
> addition to what I just sent earlier:
> 
> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
> [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending 
> orted_exit commands
> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
> WILDCARD
> 
> Firewalls are down:
> 
> [root@compute-2-1 /]# iptables -L
> Chain INPUT (policy ACCEPT)
> target prot opt source   destination
> 
> Chain FORWARD (policy ACCE

[OMPI users] 1.6.2 affinity failures

2012-12-19 Thread Brock Palen
Using openmpi 1.6.2 with intel 13.0  though the problem not specific to the 
compiler.

Using two 12 core 2 socket nodes, 

mpirun -np 4 -npersocket 2 uptime
--
Your job has requested a conflicting number of processes for the
application:

App: uptime
number of procs:  4

This is more processes than we can launch under the following
additional directives and conditions:

number of sockets:   0
npersocket:   2


Any idea why this wouldn't work?  

Another problem the following does what I expect,  two 2 socket 8 core sockets. 
16 total cores/node.

mpirun -np 8 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get
0x000f
0x000f
0x00f0
0x00f0
0x0f00
0x0f00
0xf000
0xf000

But fails at large scale:

mpirun -np 276 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get

--
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor.

This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M).  Double check that you have enough unique processors for all the
MPI processes that you are launching on this host.
You job will now abort.
--



Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985






Re: [OMPI users] 1.6.2 affinity failures

2012-12-19 Thread Ralph Castain
I'm afraid these are both known problems in the 1.6.2 release. I believe we 
fixed npersocket in 1.6.3, though you might check to be sure. On the 
large-scale issue, cpus-per-rank well might fail under those conditions. The 
algorithm in the 1.6 series hasn't seen much use, especially at scale.

In fact, cpus-per-rank has somewhat fallen by the wayside recently due to 
apparent lack of interest. I'm restoring it for the 1.7 series over the holiday 
(currently doesn't work in 1.7 or trunk).


On Dec 19, 2012, at 4:34 PM, Brock Palen  wrote:

> Using openmpi 1.6.2 with intel 13.0  though the problem not specific to the 
> compiler.
> 
> Using two 12 core 2 socket nodes, 
> 
> mpirun -np 4 -npersocket 2 uptime
> --
> Your job has requested a conflicting number of processes for the
> application:
> 
> App: uptime
> number of procs:  4
> 
> This is more processes than we can launch under the following
> additional directives and conditions:
> 
> number of sockets:   0
> npersocket:   2
> 
> 
> Any idea why this wouldn't work?  
> 
> Another problem the following does what I expect,  two 2 socket 8 core 
> sockets. 16 total cores/node.
> 
> mpirun -np 8 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get
> 0x000f
> 0x000f
> 0x00f0
> 0x00f0
> 0x0f00
> 0x0f00
> 0xf000
> 0xf000
> 
> But fails at large scale:
> 
> mpirun -np 276 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get
> 
> --
> An invalid physical processor ID was returned when attempting to bind
> an MPI process to a unique processor.
> 
> This usually means that you requested binding to more processors than
> exist (e.g., trying to bind N MPI processes to M processors, where N >
> M).  Double check that you have enough unique processors for all the
> MPI processes that you are launching on this host.
> You job will now abort.
> --
> 
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users