I deleted all of the entries out of the know_hosts file, but that didn't seem to help. I can run jobs just fine without torque on multiple nodes. I can also ssh to all nodes without using passwords, so I am not sure what the deal is.
... Okay, I found the problem. The keys that I had in know_hosts were for only the hostname i.e. prodnode2; whereas, the hostname that torque was using were fully qualified names i.e. prodnode2.brooks.af.mil and the keys did not exist for the fully qualified names. Thanks for the help. Sam Adams General Dynamics Information Technology Phone: 210.536.5945 -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of George Bosilca Sent: Friday, July 27, 2007 10:13 AM To: Open MPI Users Subject: Re: [OMPI users] torque and openmpi The key is in the first line of the provided output. One of the connection failed because a wrong ssh key. Clean your .ssh/ known_hosts and the problem will vanish. Thanks, george. On Jul 27, 2007, at 11:01 AM, Adams, Samuel D Contr AFRL/HEDR wrote: > When I run jobs with torque, I get this error message. Any ideas? > > [sam@prodnode1 all]$ cat script.sh.err > Host key verification failed. > [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in > file > base/pls_base_orted_cmds.c at line 275 > [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in > file > pls_rsh_module.c at line 1164 > [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in > file > errmgr_hnp.c at line 90 > [prodnode3.brooks.af.mil:03321] ERROR: A daemon on node > prodnode2.brooks.af.mil failed to start as expected. > [prodnode3.brooks.af.mil:03321] ERROR: There may be more information > available from > [prodnode3.brooks.af.mil:03321] ERROR: the remote shell (see above). > [prodnode3.brooks.af.mil:03321] ERROR: The daemon exited unexpectedly > with status 255. > [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in > file > base/pls_base_orted_cmds.c at line 188 > [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in > file > pls_rsh_module.c at line 1196 > ---------------------------------------------------------------------- > -- > -- > mpirun was unable to cleanly terminate the daemons for this job. > Returned value Timeout instead of ORTE_SUCCESS. > > ---------------------------------------------------------------------- > -- > -- > > Sam Adams > General Dynamics Information Technology > Phone: 210.536.5945 > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users