Put "oob=^usockā€ in your default mca param file, or add OMPI_MCA_oob=^usock to 
your environment

> On Aug 11, 2018, at 5:54 AM, Kapetanakis Giannis <bil...@edu.physics.uoc.gr> 
> wrote:
> 
> Hi,
> 
> I'm struggling to get 2.1.x to work with our HPC.
> 
> Version 1.8.8 and 3.x works fine.
> 
> In 2.1.3 and 2.1.4 I get errors and segmentation faults. The builds are with 
> infiniband and slurm support.
> mpirun locally works fine. Any help to debug this?
> 
> [node39:20090] [[50526,1],2] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20088] [[50526,1],0] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20096] [[50526,1],8] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20094] [[50526,1],6] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: 
> invalid socket state(1)
> [node39:20097] [[50526,1],9] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> [node39:20092] [[50526,1],4] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50526,0],0] from [[50526,0],1]
> 
> 
> a part from debug:
> 
> [node39:20515] mca:oob:select: Inserting component
> [node39:20515] mca:oob:select: Found 3 active transports
> [node39:20515] [[50428,1],9]: set_addr to uri 
> 3304849408.1;usock;tcp://192.168.20.113,10.1.7.69:37147;ud://181895.60.1
> [node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is 
> reachable via component usock
> [node39:20515] [[50428,1],9]:[oob_usock_component.c:349] connect to 
> [[50428,0],1]
> [node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component 
> usock
> [node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is 
> reachable via component tcp
> [node39:20515] [[50428,1],9] oob:tcp: ignoring address usock
> [node39:20515] [[50428,1],9] oob:tcp: working peer [[50428,0],1] address 
> tcp://192.168.20.113,10.1.7.69:37147
> [node39:20515] [[50428,1],9] PASSING ADDR 192.168.20.113 TO MODULE
> [node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1]
> [node39:20515] [[50428,1],9] PASSING ADDR 10.1.7.69 TO MODULE
> [node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1]
> [node39:20515] [[50428,1],9] oob:tcp: ignoring address ud://181895.60.1
> [node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component 
> tcp
> [node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is 
> reachable via component ud
> [node39:20515] [[50428,1],9] oob:ud:set_addr: setting location for peer 
> [[50428,0],1] from ud://181895.60.1
> [node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component ud
> [node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to 
> connect to proc [[50428,0],1]
> [node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to 
> connect to proc [[50428,0],1] on socket 21
> [node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to 
> connect to proc [[50428,0],1] - 0 retries
> [node39:20515] [[50428,1],9] orte_usock_peer_try_connect: Connection across 
> to proc [[50428,0],1] succeeded
> [node39:20515] [[50428,1],9] SEND CONNECT ACK
> [node39:20515] [[50428,1],9] send blocking of 232 bytes to socket 21
> [node39:20515] [[50428,1],9] blocking send complete to socket 21
> [node39:20515] [[50428,1],9]:tcp:processing set_peer cmd
> [node39:20515] [[50428,1],9] SET_PEER ADDING PEER [[50428,0],1]
> [node39:20515] [[50428,1],9] set_peer: peer [[50428,0],1] is listening on net 
> 192.168.20.113 port 37147
> [node39:20515] [[50428,1],9]:tcp:processing set_peer cmd
> [node39:20515] [[50428,1],9] set_peer: peer [[50428,0],1] is listening on net 
> 10.1.7.69 port 37147
> [node39:20515] [[50428,1],9] oob:ud:get_addr contact information: 
> ud://181905.60.1
> [node39:20471] [[50428,0],0]:tcp:recv:handler called for peer [[50428,0],1]
> [node39:20471] [[50428,0],0]:tcp:recv:handler CONNECTED
> [node39:20471] [[50428,0],0]:tcp:recv:handler allocate new recv msg
> [node39:20471] [[50428,0],0]:tcp:recv:handler read hdr
> [node39:20471] [[50428,0],0]:tcp:recv:handler allocate data region of size 
> 3697
> [node39:20471] [[50428,0],0] RECVD COMPLETE MESSAGE FROM [[50428,0],1] 
> (ORIGIN [[50428,0],1]) OF 3697 BYTES FOR DEST [[50428,0],0] TAG 2
> [node39:20471] [[50428,0],0] DELIVERING TO RML
> [node39:20512] [[50428,1],6]:usock:recv:handler called for peer [[50428,0],1]
> [node39:20512] [[50428,1],6] RECV CONNECT ACK FROM [[50428,0],1] ON SOCKET 21
> [node39:20512] [[50428,1],6] waiting for connect ack from [[50428,0],1]
> [node39:20512] [[50428,1],6] connect ack received from [[50428,0],1]
> [node39:20512] [[50428,1],6] connect-ack recvd from [[50428,0],1]
> [node39:20512] [[50428,1],6] usock_peer_recv_connect_ack: received unexpected 
> process identifier [[50428,0],0] from [[50428,0],1]
> [node39:20512] [[50428,1],6] usock_peer_close for [[50428,0],1] sd 21 state 
> FAILED
> [node39:20512] [[50428,1],6] UNABLE TO COMPLETE CONNECT ACK WITH [[50428,0],1]
> [node39:20512] [[50428,1],6] usock:lost connection called for peer 
> [[50428,0],1]
> [node39:20471] [[50428,0],0]:tcp:recv:handler called for peer [[50428,0],1]
> [node39:20471] [[50428,0],0]:tcp:recv:handler CONNECTED
> [node39:20471] [[50428,0],0]:tcp:recv:handler allocate new recv msg
> [node39:20471] [[50428,0],0]:tcp:recv:handler read hdr
> [node39:20471] [[50428,0],0]:tcp:recv:handler allocate data region of size 
> 4118
> [node39:20471] [[50428,0],0] RECVD COMPLETE MESSAGE FROM [[50428,0],1] 
> (ORIGIN [[50428,0],1]) OF 4118 BYTES FOR DEST [[50428,0],0] TAG 2
> [node39:20471] [[50428,0],0] DELIVERING TO RML
> [node39:20514] [[50428,1],8] oob:ud:port_recv_start posting 512 message 
> buffers
> 
> 
> thanks,
> 
> G
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to