[OMPI users] Communications problems w/OpenMPI

2008-12-18 Thread deadchic...@gmail.com

I've been trying to get OpenMPI to work on Amazon's EC2 but I've been
running into a communications problem. Here is the source (typical
Hello, World):



#include 
#include "mpi.h"

int main(argc,argv)
int argc;
char *argv[];
{
int myid, numprocs;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);

printf ("%d of %d: Hello world!\n", myid, numprocs);

MPI_Finalize();
return 0;
}



After compiling it, I copied it over to the other machine and tried
running it with:

mpirun -v --mca btl self,tcp -np 4 --machinefile machines /mnt/mpihw

which produces:

--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--
Process 0.1.3 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
[domU-12-31-39-02-F5-13:03965] [0,0,0]-[0,1,0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)
[domU-12-31-39-02-F5-13:03965] [0,0,0]-[0,1,2] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)
mpirun noticed that job rank 0 with PID 3653 on node
domU-12-31-39-00-B2-23 exited on signal 15 (Terminated).
1 additional process aborted (not shown)



AFAIK, the machines are able to communicate with each other on any port
you like, just not with MPI. Any idea what's wrong?




Re: [OMPI users] Communications problems w/OpenMPI

2008-12-18 Thread deadchic...@gmail.com

Jeroen Kleijer wrote:

The stable branch (1.2.x) works perfectly but _only_ when the
communication channel between machines are in the same subnet.
(ethernet)
Since you don't have that much control over which subnet your machines
get in, OpenMPI has a tendency to fail in Amazon's EC2.

However, if you're able to compile and use a version of the
development branch (1.3) you should be use compile and run the "hello
world" program without problems, regardless of the subnet they're in.


I was hoping to avoid something like that (I originally used apt-get to 
install OpenMPI) but I guess I have little choice. We'll see how that goes.


In any case, thank you for the response and solution.


Re: [OMPI users] Communications problems w/OpenMPI

2008-12-18 Thread deadchic...@gmail.com

Jeroen Kleijer wrote:

However, if you're able to compile and use a version of the
development branch (1.3) you should be use compile and run the "hello
world" program without problems, regardless of the subnet they're in.


I installed 1.3rc2 and that seems to have done the trick.

On a side note, I must say that it's great to see a compile run with 
very few, if any, warnings.