Are you not using the built-in OMPI support for Torque? The ssh keys
should be irrelevant if using the TM API in Torque (i.e., OMPI won't
be using ssh to launch remote processes; we use the internal TM API
in Torque).
On Jul 27, 2007, at 11:38 AM, Adams, Samuel D Contr AFRL/HEDR wrote:
I deleted all of the entries out of the know_hosts file, but that
didn't
seem to help. I can run jobs just fine without torque on multiple
nodes. I can also ssh to all nodes without using passwords, so I
am not
sure what the deal is.
...
Okay, I found the problem. The keys that I had in know_hosts were for
only the hostname i.e. prodnode2; whereas, the hostname that torque
was
using were fully qualified names i.e. prodnode2.brooks.af.mil and the
keys did not exist for the fully qualified names.
Thanks for the help.
Sam Adams
General Dynamics Information Technology
Phone: 210.536.5945
-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-bounces@open-
mpi.org] On
Behalf Of George Bosilca
Sent: Friday, July 27, 2007 10:13 AM
To: Open MPI Users
Subject: Re: [OMPI users] torque and openmpi
The key is in the first line of the provided output. One of the
connection failed because a wrong ssh key. Clean your .ssh/
known_hosts and the problem will vanish.
Thanks,
george.
On Jul 27, 2007, at 11:01 AM, Adams, Samuel D Contr AFRL/HEDR wrote:
When I run jobs with torque, I get this error message. Any ideas?
[sam@prodnode1 all]$ cat script.sh.err
Host key verification failed.
[prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
file
base/pls_base_orted_cmds.c at line 275
[prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
file
pls_rsh_module.c at line 1164
[prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
file
errmgr_hnp.c at line 90
[prodnode3.brooks.af.mil:03321] ERROR: A daemon on node
prodnode2.brooks.af.mil failed to start as expected.
[prodnode3.brooks.af.mil:03321] ERROR: There may be more information
available from
[prodnode3.brooks.af.mil:03321] ERROR: the remote shell (see above).
[prodnode3.brooks.af.mil:03321] ERROR: The daemon exited unexpectedly
with status 255.
[prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
file
base/pls_base_orted_cmds.c at line 188
[prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
file
pls_rsh_module.c at line 1196
---------------------------------------------------------------------
-
--
--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
---------------------------------------------------------------------
-
--
--
Sam Adams
General Dynamics Information Technology
Phone: 210.536.5945
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems