[OMPI users] Bug when mixing sent types in version 1.6

2012-06-08 Thread BOUVIER Benjamin
Hi everybody,

I have currently a bug when launching a very simple MPI program with mpirun, on 
connected nodes. This happens when I send an INT and then some CHAR strings 
from a master node to a worker node. 
Here is the minimal code to reproduce the bug :


# include 
# include 
# include 

int main(int argc, char **argv)
{
int rank, size;
const char someString[] = "Can haz cheezburgerz?";

MPI_Init(&argc, &argv);

MPI_Comm_rank( MPI_COMM_WORLD, & rank );
MPI_Comm_size( MPI_COMM_WORLD, & size );

if ( rank == 0 )
{
int len = strlen( someString );
int i;
for( i = 1; i < size; ++i)
{
MPI_Send( &len, 1, MPI_INT, i, 0, MPI_COMM_WORLD );
MPI_Send( &someString, len+1, MPI_CHAR, i, 0, MPI_COMM_WORLD );
}
} else {
char buffer[ 128 ];
int receivedLen;
MPI_Status stat;
MPI_Recv( &receivedLen, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &stat );
printf( "[Worker] Length : %d\n", receivedLen );
MPI_Recv( buffer, receivedLen+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);
printf( "[Worker] String : %s\n", buffer );
}

MPI_Finalize();
}



I know that there is a better way to send a string, by giving a maximum buffer 
size at the second MPI_Recv, but there is no the main topic here.
The launch works locally (i.e when the 2 processes are launched on one 
machine), but doesn't work when the 2 processes are dispatched in 2 machines 
through network (i.e one per host). In this case, the worker correctly reads 
the INT, and then master and worker block on the next call.
I have no issue when sending only char strings or only numbers. This only 
happens when sending char strings then numbers, or in the other order.

I'm using OpenMPI version 1.6, locally compiled. 
$ uname -a
Linux trtp7097 2.6.32-220.13.1.el6.x86_64 #1 SMP Thu Mar 29 11:46:40 EDT 2012 
x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release 
Red Hat Enterprise Linux Workstation release 6.2 (Santiago)

Is it a bad use of the framework or could it be a bug ?

Thank you in advance.
Benjamin


[OMPI users] RE : Bug when mixing sent types in version 1.6

2012-06-08 Thread BOUVIER Benjamin
Hi Jeff,

Thanks for your answer.

I have downloaded the Netpipe benchmarks suite, launched `make mpi` and 
launched with mpirun the resulting executable.

Here is an interesting fact : by launching this executable on 2 nodes, it works 
; on 3 nodes, it blocks, I guess on connect. 
Each process is running on a core, on each machine, using 100% of one CPU ; but 
nothing else happens. I have to kill the program to quit. 
Setting the option -mca btl_base_verbose to 30 shows me that the last thing 
tried by each node is to connect to other nodes.

May it be a network issue ? 

Thanks,
--
Benjamin Bouvier


De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de Jeff 
Squyres [jsquy...@cisco.com]
Date d'envoi : vendredi 8 juin 2012 16:30
À : Open MPI Users
Objet : Re: [OMPI users] Bug when mixing sent types in version 1.6

On Jun 8, 2012, at 6:43 AM, BOUVIER Benjamin wrote:

> # include 
> # include 
> # include 
>
> int main(int argc, char **argv)
> {
>int rank, size;
>const char someString[] = "Can haz cheezburgerz?";
>
>MPI_Init(&argc, &argv);
>
>MPI_Comm_rank( MPI_COMM_WORLD, & rank );
>MPI_Comm_size( MPI_COMM_WORLD, & size );
>
>if ( rank == 0 )
>{
>int len = strlen( someString );
>int i;
>for( i = 1; i < size; ++i)
>{
>MPI_Send( &len, 1, MPI_INT, i, 0, MPI_COMM_WORLD );
>MPI_Send( &someString, len+1, MPI_CHAR, i, 0, MPI_COMM_WORLD );
>}
>} else {
>char buffer[ 128 ];
>int receivedLen;
>MPI_Status stat;
>MPI_Recv( &receivedLen, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &stat );
>printf( "[Worker] Length : %d\n", receivedLen );
>MPI_Recv( buffer, receivedLen+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, 
> &stat);
>printf( "[Worker] String : %s\n", buffer );
>}
>
>MPI_Finalize();
> }

I don't see anything obviously wrong with this code.

> I know that there is a better way to send a string, by giving a maximum 
> buffer size at the second MPI_Recv, but there is no the main topic here.
> The launch works locally (i.e when the 2 processes are launched on one 
> machine), but doesn't work when the 2 processes are dispatched in 2 machines 
> through network (i.e one per host). In this case, the worker correctly reads 
> the INT, and then master and worker block on the next call.

That's very odd.

> I have no issue when sending only char strings or only numbers. This only 
> happens when sending char strings then numbers, or in the other order.

That's even more odd.

Can you run standard benchmarks like MPI net pipe, and/or the OSU benchmarks?  
(across multiple nodes, that is)

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] RE : RE : Bug when mixing sent types in version 1.6

2012-06-11 Thread BOUVIER Benjamin
Hi,

> I'd guess that running net pipe with 3 procs may be undefined.

It is indeed undefined. Running the net pipe program locally with 3 processors 
blocks, on my computer.

This issue is especially weird as there is no problem for running the example 
program on network with MPICH2 implementation, for 2 processes.

However, with MPICH2, it fails with 3 processes and blocks also on connect 
("Connection refused"), which could indicate that it's actually a network 
issue, with both MPICH2 and OMPI. I don't know how many connections OMPI use to 
send the data in the example program, but with the assumption that it tries to 
open 2 connections (while for the same program, MPICH2 only uses one 
connection, which is another hypothesis), maybe the number of connections is 
the right way to look for. I'll ask MPICH2 users on their mailing list, so as 
to get their opinion about it.

Now that I know the program doesn't work both with OMPI and MPICH2 
implementations, I guess it's not dependant of MPI implementation.

If you have any ideas or comments, I would be pleased to hear them.

--
Benjamin Bouvier



[OMPI users] RE : RE : RE : Bug when mixing sent types in version 1.6

2012-06-11 Thread BOUVIER Benjamin
Hi,

Thanks for your hints Jeff.
I've just tried without any firewalls on involved machines, but the issue 
remains.

# /etc/init.d/ip6tables status
ip6tables: Firewall is not running.
# /etc/init.d/iptables status
iptables: Firewall is not running.

The machines have the host names "node1", "node2" and "node3".
I launch the basic program on one machine, asking node1 and node2 to be hosts. 
Typing `netstat -a | grep node1` from node2 shows me that node1 and node2 are 
connected by tcp, as the connection is marked as ESTABLISHED. I have the same 
thing when I do `netstat -a | grep node2` from node1. However, the program 
keeps blocking.

What else could provoke that failure ?
--
Benjamin BOUVIER 


To start, I would ensure that all firewalling  (e.g., iptables) is disabled on 
all machines involved.

On Jun 11, 2012, at 10:16 AM, BOUVIER Benjamin wrote:

> Hi,
>
>> I'd guess that running net pipe with 3 procs may be undefined.
>
> It is indeed undefined. Running the net pipe program locally with 3 
> processors blocks, on my computer.
>
> This issue is especially weird as there is no problem for running the example 
> program on network with MPICH2 implementation, for 2 processes.
>
> However, with MPICH2, it fails with 3 processes and blocks also on connect 
> ("Connection refused"), which could indicate that it's actually a network 
> issue, with both MPICH2 and OMPI. I don't know how many connections OMPI use 
> to send the data in the example program, but with the assumption that it 
> tries to open 2 connections (while for the same program, MPICH2 only uses one 
> connection, which is another hypothesis), maybe the number of connections is 
> the right way to look for. I'll ask MPICH2 users on their mailing list, so as 
> to get their opinion about it.
>
> Now that I know the program doesn't work both with OMPI and MPICH2 
> implementations, I guess it's not dependant of MPI implementation.
>
> If you have any ideas or comments, I would be pleased to hear them.
>
> --
> Benjamin Bouvier
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] RE : RE : RE : RE : Bug when mixing sent types in version 1.6

2012-06-11 Thread BOUVIER Benjamin
Wow. I thought in the first place that all combinations would be equivalent, 
but in fact, this is not the case...
I've kept the firewalls down during all the tests.

> - on node1, "mpirun --host node1,node2 ring_c"
Works.

> - on node1, "mpirun --host node1,node3 ring_c"
> - on node1, "mpirun --host node2,node3 ring_c"
Blocks after "Process 0 sent to 1".

> - on node1, "mpirun --host node1,node2,node3 ring_c"
"Process 0 sending 10 to 1, tag 201 (3 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9" then blocks

> Repeat all 4 from node2.
On node 2 : 
- "mpirun --host node2,node1 ring_c" : OK
- "mpirun --host node2,node3 ring_c" : blocks at same point that above.
- "mpirun --host node1,node3 ring_c" : blocks at same point that above.
- "mpirun --host node1,node2,node3 ring_c" : blocks at same point that 
mentioned above in case of 3 hosts.

I recompiled this test program with MPICH2 and have the exactly same issues at 
the same time. 
There is really something wrong with that network...
--
Benjamin Bouvier



[OMPI users] RE : RE : RE : RE : RE : Bug when mixing sent types in version 1.6

2012-06-12 Thread BOUVIER Benjamin
Hi,

I've found, in ifconfig, that each node has 2 interfaces, eth0 and eth1. I've 
run mpiexec with parameter --mca btl_tcp_if_include eth0 (or eth1) to see if 
there was some issues between nodes. Here are the results :
- node1,node2 works with eth1, not with eth0.
- node1,node3 works with eth1, not with eth0.
- node2,node3 does not work with eth1, but works with eth0.
- node1,node2,node3 works with eth1 (!), not with eth0.
These tests even work with activated firewalls.

Actually, order of nodes is important, as `mpiexec --mca btl_tcp_if_include 
eth0 --host node1,node2 ./ring_c` does not work, but `mpiexec --mca 
btl_tcp_if_include eth0 --host node2,node1 ./ring_c` works. Same thing append 
if I change order when launching the 3 processes (putting node2 at the first 
position). I find that a little bit disturbing, but I guess the network 
configuration is guilty.

Thanks a lot Jeff Squyres, your hints helped me to find the source of the 
problem. As it must often happen, the problem didn't come from OpenMPI but from 
network configuration.
I'll ask my sysadmin to help me configuring the interfaces, so as it to work 
without defining mca parameter.

Thank you one more time.
--
Benjamin Bouvier


> What's the output from ifconfig on all nodes?
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to: 
>http://www.cisco.com/web/about/doing_business/legal/cri/