Prentice Bisbal wrote:


I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)

Yes. I am able to log in remotely to all nodes from the master, and to each node from each node without a password. Each node mounts the same /home directory from the master, so they have the same copy of all the ssh and rsh keys.

This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.
> 3. Change your machinefile to include 4 completely different hosts.

This does not seem to have any beneficial effect.

The test program run from the master (pleiades) with any combination of 3 other nodes hangs during communication. This includes not using --machinefile and using -host; i.e.

$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
 node           1 : Hello world
 node           0 : Hello world
 node           2 : Hello world

2. Run the mpirun command from a different host. I'd try running it from
several different hosts.

The mpirun command does not seem to work when launched from one of the nodes. 
As an example:

Running on node asterope:

asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out

Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports

(hangs)

I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.

The machine file is a simple list of hostnames, as an example:

m43
taygeta
asterope



Cheers,
Ethan

--
Dr. Ethan Deneault
Assistant Professor of Physics
SC-234
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555

Reply via email to