Hi, all

I encountered a problem about mpirun and SSH when using Open MPI 1.7rc8.


I have a 4-node cluster. This is the hostfile:


[mpiuser@testnode11 openmpi-1.6.4]$ cat ~/work/hostfile
testnode11
testnode12
testnode13
testnode14


I had configured SSH, copying ".ssh/id_rsa.pub" on testnode11 to 
".ssh/authorized_keys" on all the 4 nodes.
So that I can login all the 4 nodes from testnode11 without a password.


The following test worked well with Open MPI 1.6.4.


[mpiuser@testnode11 openmpi-1.6.4]$ mpirun -hostfile ~/work/hostfile -np 8 
~/src/openmpi-1.6.4/examples/ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 4 exiting
Process 2 exiting
Process 3 exiting
Process 1 exiting
Process 6 exiting
Process 7 exiting
Process 5 exiting


However, when I switched to Open MPI 1.7rc8, the same test did not work.


[mpiuser@testnode11 openmpi-1.7rc8]$ mpirun -hostfile ~/work/hostfile -np 8 
~/src/openmpi-1.7rc8/examples/ring_c
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
[testnode12:06990] [[50636,0],1] ORTE_ERROR_LOG: A message is attempting to be 
sent to a process whose contact information is unknown in file rml_oob_send.c 
at line 362
[testnode12:06990] [[50636,0],1] attempted to send to [[50636,0],3]: tag 15
[testnode12:06990] [[50636,0],1] ORTE_ERROR_LOG: A message is attempting to be 
sent to a process whose contact information is unknown in file 
base/grpcomm_base_xcast.c at line 166


I had checked the logs of SSH, and found the direct reason. A SSH request from 
testnode12 to testnode14 was denied.


[mpiuser@testnode11 openmpi-1.7rc8]$ ssh root@testnode14 tail -f /var/log/secure
...
Mar 14 15:39:01 testnode14 sshd[31610]: Connection closed by testnode12
Mar 14 15:39:04 testnode14 sshd[31611]: Failed password for mpiuser from 
testnode12 port 55964 ssh2
Mar 14 15:39:04 testnode14 sshd[31611]: Failed password for mpiuser from 
testnode12 port 55964 ssh2
Mar 14 15:39:04 testnode14 sshd[31612]: Connection closed by testnode12
...


So I am puzzled. I launched mpirun on testnode11, but I do not know why 
testnode12 would send a SSH request to testnode14.
One solution is to copy ".ssh/id_rsa.pub" on all the nodes to 
".ssh/authorized_keys" on all the nodes, but that is not what I want.
Is there any way to control that all the SSH requests are sent from the node 
where mpirun executed, to all the nodes?
I had checked all the orte parameters, and no answer found. Please give some 
suggestions.


Thanks!

Reply via email to