Jose --

This sounds like a problem that we just recently fixed in the 1.0.x branch -- there were some situations where the "wrong" ethernet device could have been picked by Open MPI (e.g., if you have a cluster with all private IP addresses, and you run an MPI job that spans the head node and the compute nodes, but the head node has multiple IP addresses). Can you try the latest 1.0.2 release candidate tarball and let us know if this fixes the problem?

        http://www.open-mpi.org/software/ompi/v1.0/

Specifically, you should no longer need to specify that btl_tcp_if_include parameter -- Open MPI should be able to "figure it all out" for you.

Let us know if this works for you.



On Mar 2, 2006, at 1:28 PM, Jose Pedro Garcia Mahedero wrote:

Finally it was a network problem. I had to disable one network interface in the master node of the cluster by setting btl_tcp_if_include = eth1 on file /usr/local/etc/openmpi-mca- params.conf

thank you all for your help.

Jose Pedro
On 3/1/06, Jose Pedro Garcia Mahedero < jpgmahed...@gmail.com> wrote:
OK, it ALMOST works!!

Now I've install MPI on a non clustered machine and it works, but surprisingly, it works fine from machine OUT1 as master to machine CLUSTER1 as slave, but (here was my surprise) it doesn't work on the other sense! If I run the same program with CLUSTER1 as master it only sends one message from master to slave and blocks while sending the second message. Maybe it is a firewall/iptable problem.

Does anybody know which ports does MPI use to send requests/ responses ot how to trace it? What I really don't understand is why it happens at the second message and not the first one :-( I know my slave never finishes, but It is not intended to right now, it will in a next version, but I think it is not the main problem :-S

I send an attachemtn with the (so simple) code and a tarball with my config.log

thaks


On 3/1/06, Jose Pedro Garcia Mahedero < jpgmahed...@gmail.com> wrote:You're right, I'll try to use netpipes first and then the application. If it doesn't workt I'll send configs and more detailed informations

Thank you!


On 3/1/06, Brian Barrett <brbar...@open-mpi.org> wrote: Jose -

I noticed that your output doesn't appear to match what the source
code is capable of generating.  It's possible that you're running
into problems with the code that we can't see because you didn't send
a complete version of the source code.

You might want to start by running some 3rd party codes that are
known to be good, just to make sure that your MPI installation checks
out.  A good start is NetPIPE, which runs between two peers and gives
latency / bandwidth information.  If that runs, then it's time to
look at your application.  If that doesn't run, then it's time to
look at the MPI installation in more detail.  In this case, it would
be useful to see all of the information requested here:

   http://www.open-mpi.org/community/help/

as well as from running the mpirun command used to start NetPIPE with
the -d option, so something like:

   mpirun -np 2 -hostfile foo -d ./NPMpi

Brian

On Feb 28, 2006, at 9:29 AM, Jose Pedro Garcia Mahedero wrote:

> Hello everybody.
>
> I'm new to MPI and I'm having some problems while runnig a simple
> pingpong program in more than one node.
>
> 1.- I followed all the instructions and installed open MPI without
> problems in  a Beowulf cluster.
> 2.-  Ths cluster is working OK and ssh keys are set for not
> password prompting
> 3.- miexec seems to run OK.
> 4.- Now I'm using just 2 nodes: I've tried a simple ping-pong
> application but my master only sends one request!!
> 5.- I reduced the problem by trying to send just two mesages to the
> same node:
>
> int main(int argc, char **argv){
>   int myrank;
>
>   /* Initialize MPI */
>
>   MPI_Init(&argc, &argv);
>
>   /* Find out my identity in the default communicator */
>
>   MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
>   if (myrank == 0) {
>     int work = 100;
>     int count=0;
>     for (int i =0; i < 10; i++){
>       cout << "MASTER IS SLEEPING..." << endl;
>       sleep(3);
>       cout << "MASTER AWAKE WILL SEND["<< count++ << "]:" << work
> << endl;
>        MPI_Send(&work, 1, MPI_INT, 1, WORKTAG,   MPI_COMM_WORLD);
>     }
>   } else {
>       int count =0;
>       int work;
>       MPI_Status status;
>       while (true){
>           MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG,
> MPI_COMM_WORLD, &status);
>          cout << "SLAVE[" << myrank << "] RECEIVED[" << count++ <<
> "]:" << work <<endl;
>         if (status.MPI_TAG == DIETAG) {
>           break;
>         }
>     }// while
>   }
>   MPI_Finalize();
>
>
>
> 6a.- RESULTS  (if I put more than one machine in my mpihostsfile),
> my master sends the first message  and my slave receives it
> perfectly. But my master doesnt send its second .
> message:
>
>
>
> Here's my output
>
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[0]:100
> MASTER IS SLEEPING...
> SLAVE[1] RECEIVED[0]:100MPI_STATUS.MPI_ERROR:0
> MASTER AWAKE WILL SEND[1]:100
>
> 6b.- RESULTS (if I put ONLY  1 machine in my mpihostsfile),
> everything is OK until iteration 9!!!
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[0]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[1]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[2]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[3]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[4]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[5]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[6]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[7]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[8]:100
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[9]:100
> SLAVE[1] RECEIVED[0]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[1]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[2]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[3]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[4]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[5]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[6]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[7]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[8]:100MPI_STATUS.MPI_ERROR:0
> SLAVE[1] RECEIVED[9]:100MPI_STATUS.MPI_ERROR:0
> --------------------------------
>
> I know this is a lot of text, but I wanted to give a mamixum
> detailed question. I've been search in FAQ, but still don't know
> what (and why) is going on...
>
> Anyone can help please :-)  ?
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/


Reply via email to