On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote: > For now, may I point out something I noticed out of the > DEBUG3 Output last night? > > I found this line: > >> debug1: Sending command: orted --daemonize -mca ess env -mca >> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
What this means is that ssh sent the "orted ..." command to the remote side. As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" -- it's a helper thingy that mpirun launches on the remote nodes before launching your actual application. All those parameters (from --daemonize through ...:56064") are options for orted. All of that gorp is considered internal to Open MPI -- most people never see that stuff. > Followed by: > >> debug2: channel 0: request exec confirm 1 >> debug2: fd 3 setting TCP_NODELAY >> debug2: callback done >> debug2: channel 0: open confirm rwindow 0 rmax 32768 >> debug3: Wrote 272 bytes for a total of 1893 >> debug2: channel 0: rcvd adjust 2097152 >> debug2: channel_input_status_confirm: type 99 id 0 This is just more status information about the ssh connection; it doesn't really have any direct relation to Open MPI. I don't know offhand if ssh displays the ack that a command successfully ran. If you're not convinced that it did, then login to the other node while the command is hung and run a ps to see if the orted is actually running or not. I *suspect* that it is running, but that it's just hung for some reason. ----- Here's some suggestions to try debugging: On your new linux AMI instances (some of this may be redundant with what you did already): - ensure that firewalling is disabled on all instances - ensure that your .bashrc (or whatever startup file is relevant to your shell) is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI installation. Ensure the *PREFIX* these variables to guarantee that you don't get interference from already-installed versions of Open MPI (e.g., if Open MPI is installed by default on your AMI and you weren't aware of it) - setup a simple, per-user SSH key, perhaps something like this: A$ rm -rf $HOME/.ssh (remove what you had before; let's just start over) A$ ssh-keygen -t dsa (hit enter to accept all defaults and set no passphrase) A$ cd $HOME/.ssh A$ cp id_dsa.pub authorized_keys A$ chmod 644 authorized_keys A$ ssh othernode (login to node B) B$ ssh-keygen -t dsa (hit enter to accept all defaults and set no passphrase; just to create $HOME/.ssh with the right permissions, etc.) B$ scp @firstnode:.ssh/id_dsa\* . (enter your password on A -- we're overwriting all the files here) B$ cp id_dsa.pub authorized_keys B$ chmod 644 authorized_keys Now you should be able to ssh from one node to the other without passwords: A$ ssh othernode hostname B A$ and B$ ssh firstnode hostname A B$ Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to ensure that non-interactive logins work properly. That's what Open MPI will use under the covers. - Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh sessions (i.e., some .bashrc's will exit "early" if they detect that it is a non-interactive session). For example: A$ ssh othernode env | grep -i path Ensure that the output shows the path and ld_library_path locations for Open MPI at the beginning of those variables. To go for the gold, you can try this, too: A$ ssh othernode which ompi_info (if all paths are set right, this should show the ompi_info of your 1.4.3 install) A$ ssh othernode ompi_info (should show all the info about your 1.4.3 install) - If all the above works, then test with a simple, non-MPI application across both nodes: A$ mpirun --host firstnode,othernode -np 2 hostname A B A$ - When that works, you should be able to test with a simple MPI application (e.g., the examples/ring_c.c file in the Open MPI distribution): A$ cd /path/to/open/mpi/source A$ cd examples A$ make ... A$ scp ring_c @othernode:/path/to/open/mpi/source/examples ... A$ mpirun --host firstnode,othernode -np 4 ring_c Make sense? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/