Hi, all
I encountered a problem about mpirun and SSH when using Open MPI 1.7rc8. I have a 4-node cluster. This is the hostfile: [mpiuser@testnode11 openmpi-1.6.4]$ cat ~/work/hostfile testnode11 testnode12 testnode13 testnode14 I had configured SSH, copying ".ssh/id_rsa.pub" on testnode11 to ".ssh/authorized_keys" on all the 4 nodes. So that I can login all the 4 nodes from testnode11 without a password. The following test worked well with Open MPI 1.6.4. [mpiuser@testnode11 openmpi-1.6.4]$ mpirun -hostfile ~/work/hostfile -np 8 ~/src/openmpi-1.6.4/examples/ring_c Process 0 sending 10 to 1, tag 201 (8 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 4 exiting Process 2 exiting Process 3 exiting Process 1 exiting Process 6 exiting Process 7 exiting Process 5 exiting However, when I switched to Open MPI 1.7rc8, the same test did not work. [mpiuser@testnode11 openmpi-1.7rc8]$ mpirun -hostfile ~/work/hostfile -np 8 ~/src/openmpi-1.7rc8/examples/ring_c Permission denied, please try again. Permission denied, please try again. Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). [testnode12:06990] [[50636,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 362 [testnode12:06990] [[50636,0],1] attempted to send to [[50636,0],3]: tag 15 [testnode12:06990] [[50636,0],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/grpcomm_base_xcast.c at line 166 I had checked the logs of SSH, and found the direct reason. A SSH request from testnode12 to testnode14 was denied. [mpiuser@testnode11 openmpi-1.7rc8]$ ssh root@testnode14 tail -f /var/log/secure ... Mar 14 15:39:01 testnode14 sshd[31610]: Connection closed by testnode12 Mar 14 15:39:04 testnode14 sshd[31611]: Failed password for mpiuser from testnode12 port 55964 ssh2 Mar 14 15:39:04 testnode14 sshd[31611]: Failed password for mpiuser from testnode12 port 55964 ssh2 Mar 14 15:39:04 testnode14 sshd[31612]: Connection closed by testnode12 ... So I am puzzled. I launched mpirun on testnode11, but I do not know why testnode12 would send a SSH request to testnode14. One solution is to copy ".ssh/id_rsa.pub" on all the nodes to ".ssh/authorized_keys" on all the nodes, but that is not what I want. Is there any way to control that all the SSH requests are sent from the node where mpirun executed, to all the nodes? I had checked all the orte parameters, and no answer found. Please give some suggestions. Thanks!