On Feb 10, 2006, at 12:18 PM, James Conway wrote:

Open MPI uses random port numbers for all it's communication.
(etc)

Thanks for the explanation. I will live with the open Firewall, and
look at the ipfw docs for writing a script.

That may be somewhat difficult. We previously looked into making LAM/ MPI work behind firewalls and ran into some unexpected issues -- the short version was that, at least for the way LAM was setup, even if you could restrict the port numbers that LAM would choose for its TCP communications, you had to have a virtual host out in front of the firewall that would relay the traffic to the appropriate internal host. Specifically, you had to have an IP address out in front of the firewall for each host so that it would route to the appropriate back-end instance of the MPI application on the appropriate host.

The real solution here is to have Open MPI be able to route its TCP communications around through multiple hosts instead of assuming that it is always talking directly to the target host. (LAM actually had the run-time layer version of that implemented eons ago, but we've never used it -- and more changes would be needed up at the TCP layer to do the same thing)

We have not yet added any TCP routing capabilities in Open MPI. It's on the long-range to-do list (meaning: several of us have talked about it and agree that it's a good idea, but no one has committed to any timeframe as to when it would be done). Contributions from the community would be greatly appreciated. :-)

Now I have a more "core" OpenMPI problem, which may be just
unfamiliarity on my part. I seem to have the environment variables
set up alright though - the code runs, but doesn't complete.

I have copied the "MPI Tutorial: The cannonical ring program" from
<http://www.lam-mpi.org/tutorials/>. It compiles and runs fine on the
localhost (one CPU, one or more MPI processes). If I copy it to a
remotehost, it does one round of passing the 'tag' then stalls. I
modified the print statements a bit to see where in the code it
stalls, but the loop hasn't changed. This is what I see happening:
1. Process 0 successfully kicks off the pass-around by sending the
tag to the next process (1), and then enters the loop where it waits
for the tag to come back.
2. Process 1 enters the loop, receives the tag and passes it on (back
to process 0 since this is a ring of 2 players only).
3. Process 0 successfully receives the tag, decrements it, and calls
the next send (MPI_Send) but it doesn't return from this. I have a
print statement right after (with fflush) but there is no output. The
CPU is maxed out on both the local and remote hosts, I assume some
kind of polling.
4. Needless to say, Process 1 never reports receipt of the tag.

Output (with a little re-ordering to make sense) is:
    mpirun --hostfile my_mpi_hosts --np 2 mpi_test1
    Process rank 0: size = 2
    Process rank 1: size = 2
    Enter the number of times around the ring: 5

    Process 0 doing first send of '4' to 1
    Process 0 finished sending, now entering loop

    Process 0 waiting to receive from 1

    Process 1 waiting to receive from 0
    Process 1 received '4' from 0
    Process 1 sending '4' to 0
    Process 1 finished sending
    Process 1 waiting to receive from 0

    Process 0 received '4' from 1
Process 0 decremented num
    Process 0 sending '3' to 1
    !---- nothing more - hangs at 100% cpu until ctrl-
    !---- should see "Process 0 finished sending"

Since process 0 succeeds in calling MPI_Send before the loop, and in
calling MPI_Recv at the start of the loop, the communications appear
to be working. Likewise, process 1 succeeds in receiving and sending
within the loop. However, if its significant, these calls work one
time for each process - the second time MPI_Send is called by process
0, there is a hang.

Well that is definitely odd. The fact that the first send finishes and the second does not is quite fishy. A few questions:

- Have you absolutely entirely disabled all firewalling between the two hosts? - Do you have only one TCP interface on both machines? If you have more than one, we can try telling Open MPI to ignore one of them.

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/


Reply via email to