On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote:

> For now, may I point out something I noticed out of the
> DEBUG3 Output last night?
> 
> I found this line:
> 
>>  debug1: Sending command:  orted --daemonize -mca ess env -mca
>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"

What this means is that ssh sent the "orted ..." command to the remote side.  

As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" -- it's 
a helper thingy that mpirun launches on the remote nodes before launching your 
actual application.  All those parameters (from --daemonize through ...:56064") 
are options for orted.  

All of that gorp is considered internal to Open MPI -- most people never see 
that stuff.

> Followed by:
> 
>>  debug2: channel 0: request exec confirm 1
>>  debug2: fd 3 setting TCP_NODELAY
>>  debug2: callback done
>>  debug2: channel 0: open confirm rwindow 0 rmax 32768
>>  debug3: Wrote 272 bytes for a total of 1893
>>  debug2: channel 0: rcvd adjust 2097152
>>  debug2: channel_input_status_confirm: type 99 id 0

This is just more status information about the ssh connection; it doesn't 
really have any direct relation to Open MPI.

I don't know offhand if ssh displays the ack that a command successfully ran.  
If you're not convinced that it did, then login to the other node while the 
command is hung and run a ps to see if the orted is actually running or not.  I 
*suspect* that it is running, but that it's just hung for some reason.

-----

Here's some suggestions to try debugging:

On your new linux AMI instances (some of this may be redundant with what you 
did already):

- ensure that firewalling is disabled on all instances

- ensure that your .bashrc (or whatever startup file is relevant to your shell) 
is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI installation.  
Ensure the *PREFIX* these variables to guarantee that you don't get 
interference from already-installed versions of Open MPI (e.g., if Open MPI is 
installed by default on your AMI and you weren't aware of it)

- setup a simple, per-user SSH key, perhaps something like this:

     A$ rm -rf $HOME/.ssh
   (remove what you had before; let's just start over)

     A$ ssh-keygen -t dsa
   (hit enter to accept all defaults and set no passphrase)

     A$ cd $HOME/.ssh
     A$ cp id_dsa.pub authorized_keys
     A$ chmod 644 authorized_keys
     A$ ssh othernode
   (login to node B)

     B$ ssh-keygen -t dsa
   (hit enter to accept all defaults and set no passphrase; just to create 
$HOME/.ssh with the right permissions, etc.)

     B$ scp @firstnode:.ssh/id_dsa\* .
   (enter your password on A -- we're overwriting all the files here)

     B$ cp id_dsa.pub authorized_keys
     B$ chmod 644 authorized_keys

Now you should be able to ssh from one node to the other without passwords:

     A$ ssh othernode hostname
     B
     A$

and

     B$ ssh firstnode hostname
     A
     B$

Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to 
ensure that non-interactive logins work properly.  That's what Open MPI will 
use under the covers.

- Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh 
sessions (i.e., some .bashrc's will exit "early" if they detect that it is a 
non-interactive session).  For example:

     A$ ssh othernode env | grep -i path

Ensure that the output shows the path and ld_library_path locations for Open 
MPI at the beginning of those variables.  To go for the gold, you can try this, 
too:

     A$ ssh othernode which ompi_info
     (if all paths are set right, this should show the ompi_info of your 1.4.3 
install)
     A$ ssh othernode ompi_info
     (should show all the info about your 1.4.3 install)

- If all the above works, then test with a simple, non-MPI application across 
both nodes:

     A$ mpirun --host firstnode,othernode -np 2 hostname
     A
     B
     A$

- When that works, you should be able to test with a simple MPI application 
(e.g., the examples/ring_c.c file in the Open MPI distribution):

     A$ cd /path/to/open/mpi/source
     A$ cd examples
     A$ make
     ...
     A$ scp ring_c @othernode:/path/to/open/mpi/source/examples
     ...
     A$ mpirun --host firstnode,othernode -np 4 ring_c

Make sense?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to