Put "oob=^usockā in your default mca param file, or add OMPI_MCA_oob=^usock to your environment
> On Aug 11, 2018, at 5:54 AM, Kapetanakis Giannis <bil...@edu.physics.uoc.gr> > wrote: > > Hi, > > I'm struggling to get 2.1.x to work with our HPC. > > Version 1.8.8 and 3.x works fine. > > In 2.1.3 and 2.1.4 I get errors and segmentation faults. The builds are with > infiniband and slurm support. > mpirun locally works fine. Any help to debug this? > > [node39:20090] [[50526,1],2] usock_peer_recv_connect_ack: received unexpected > process identifier [[50526,0],0] from [[50526,0],1] > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20088] [[50526,1],0] usock_peer_recv_connect_ack: received unexpected > process identifier [[50526,0],0] from [[50526,0],1] > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20096] [[50526,1],8] usock_peer_recv_connect_ack: received unexpected > process identifier [[50526,0],0] from [[50526,0],1] > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20094] [[50526,1],6] usock_peer_recv_connect_ack: received unexpected > process identifier [[50526,0],0] from [[50526,0],1] > [node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: > invalid socket state(1) > [node39:20097] [[50526,1],9] usock_peer_recv_connect_ack: received unexpected > process identifier [[50526,0],0] from [[50526,0],1] > [node39:20092] [[50526,1],4] usock_peer_recv_connect_ack: received unexpected > process identifier [[50526,0],0] from [[50526,0],1] > > > a part from debug: > > [node39:20515] mca:oob:select: Inserting component > [node39:20515] mca:oob:select: Found 3 active transports > [node39:20515] [[50428,1],9]: set_addr to uri > 3304849408.1;usock;tcp://192.168.20.113,10.1.7.69:37147;ud://181895.60.1 > [node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is > reachable via component usock > [node39:20515] [[50428,1],9]:[oob_usock_component.c:349] connect to > [[50428,0],1] > [node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component > usock > [node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is > reachable via component tcp > [node39:20515] [[50428,1],9] oob:tcp: ignoring address usock > [node39:20515] [[50428,1],9] oob:tcp: working peer [[50428,0],1] address > tcp://192.168.20.113,10.1.7.69:37147 > [node39:20515] [[50428,1],9] PASSING ADDR 192.168.20.113 TO MODULE > [node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1] > [node39:20515] [[50428,1],9] PASSING ADDR 10.1.7.69 TO MODULE > [node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1] > [node39:20515] [[50428,1],9] oob:tcp: ignoring address ud://181895.60.1 > [node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component > tcp > [node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is > reachable via component ud > [node39:20515] [[50428,1],9] oob:ud:set_addr: setting location for peer > [[50428,0],1] from ud://181895.60.1 > [node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component ud > [node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to > connect to proc [[50428,0],1] > [node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to > connect to proc [[50428,0],1] on socket 21 > [node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to > connect to proc [[50428,0],1] - 0 retries > [node39:20515] [[50428,1],9] orte_usock_peer_try_connect: Connection across > to proc [[50428,0],1] succeeded > [node39:20515] [[50428,1],9] SEND CONNECT ACK > [node39:20515] [[50428,1],9] send blocking of 232 bytes to socket 21 > [node39:20515] [[50428,1],9] blocking send complete to socket 21 > [node39:20515] [[50428,1],9]:tcp:processing set_peer cmd > [node39:20515] [[50428,1],9] SET_PEER ADDING PEER [[50428,0],1] > [node39:20515] [[50428,1],9] set_peer: peer [[50428,0],1] is listening on net > 192.168.20.113 port 37147 > [node39:20515] [[50428,1],9]:tcp:processing set_peer cmd > [node39:20515] [[50428,1],9] set_peer: peer [[50428,0],1] is listening on net > 10.1.7.69 port 37147 > [node39:20515] [[50428,1],9] oob:ud:get_addr contact information: > ud://181905.60.1 > [node39:20471] [[50428,0],0]:tcp:recv:handler called for peer [[50428,0],1] > [node39:20471] [[50428,0],0]:tcp:recv:handler CONNECTED > [node39:20471] [[50428,0],0]:tcp:recv:handler allocate new recv msg > [node39:20471] [[50428,0],0]:tcp:recv:handler read hdr > [node39:20471] [[50428,0],0]:tcp:recv:handler allocate data region of size > 3697 > [node39:20471] [[50428,0],0] RECVD COMPLETE MESSAGE FROM [[50428,0],1] > (ORIGIN [[50428,0],1]) OF 3697 BYTES FOR DEST [[50428,0],0] TAG 2 > [node39:20471] [[50428,0],0] DELIVERING TO RML > [node39:20512] [[50428,1],6]:usock:recv:handler called for peer [[50428,0],1] > [node39:20512] [[50428,1],6] RECV CONNECT ACK FROM [[50428,0],1] ON SOCKET 21 > [node39:20512] [[50428,1],6] waiting for connect ack from [[50428,0],1] > [node39:20512] [[50428,1],6] connect ack received from [[50428,0],1] > [node39:20512] [[50428,1],6] connect-ack recvd from [[50428,0],1] > [node39:20512] [[50428,1],6] usock_peer_recv_connect_ack: received unexpected > process identifier [[50428,0],0] from [[50428,0],1] > [node39:20512] [[50428,1],6] usock_peer_close for [[50428,0],1] sd 21 state > FAILED > [node39:20512] [[50428,1],6] UNABLE TO COMPLETE CONNECT ACK WITH [[50428,0],1] > [node39:20512] [[50428,1],6] usock:lost connection called for peer > [[50428,0],1] > [node39:20471] [[50428,0],0]:tcp:recv:handler called for peer [[50428,0],1] > [node39:20471] [[50428,0],0]:tcp:recv:handler CONNECTED > [node39:20471] [[50428,0],0]:tcp:recv:handler allocate new recv msg > [node39:20471] [[50428,0],0]:tcp:recv:handler read hdr > [node39:20471] [[50428,0],0]:tcp:recv:handler allocate data region of size > 4118 > [node39:20471] [[50428,0],0] RECVD COMPLETE MESSAGE FROM [[50428,0],1] > (ORIGIN [[50428,0],1]) OF 4118 BYTES FOR DEST [[50428,0],0] TAG 2 > [node39:20471] [[50428,0],0] DELIVERING TO RML > [node39:20514] [[50428,1],8] oob:ud:port_recv_start posting 512 message > buffers > > > thanks, > > G > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users