Todd,

Similar issues were also reported when there is Network Translation (NAT) between hosts, and that occured when using kvm/qemu virtual machine running on the same host.


First you need to list the available interfaces on both nodes. Then try to restrict to a single interface that is known to be working

(no firewall and no NAT)

(e.g. mpirun --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 ...)


If that does not help make sure there is no NAT:

on the first node, run

nc -v -l 1234

then on the other node, run

nc <ip of the first node> 1234


If you go back to the first node, you should see the expected ip of the second node.

If not, there is NAT somewhere and that does not fly well with Open MPI


Cheers,


Gilles


On 3/28/2023 8:53 AM, Todd Spencer via users wrote:

OpenMPI Users,

I hope this email finds you all well. I am writing to bring to your attention an issue that I have encountered while using OpenMPI.

I received the following error message while running a job:

"Open MPI detected an inbound MPI TCP connection request from a peer that appears to be part of this MPI job (i.e., it identified itself as part of this Open MPI job), but it is from an IP address that is unexpected. This is highly unusual. The inbound connection has been dropped, and the peer should simply try again with a different IP interface (i.e., the job should hopefully be able to continue).

Local host: node02 Local PID: 17805 Peer hostname: node01 ([[23078,1],2]) Source IP of socket: 192.168.0.3 Known IPs of peer: 192.168.0.225"

I have tried to troubleshoot the issue but to no avail. As a new user to this subject, I am not sure what could be causing this issue. I did try forcing the nodes to talk to each other using eth0 using the "-mca btl_tcp_if_include eth0" command but it did not work.

I found a GitHub thread <https://github.com/open-mpi/ompi/issues/5818> from 2018 that discussed the issue, but since I am new to this, a lot of the subject matter went over my head. Could you please advise on what could be causing this issue and how to resolve it? If you need any additional information, I would be happy to provide it.

Thank you in advance for your help.

Best regards,

Todd


Reply via email to