On Dec 15, 2010, at 10:14 AM, Gilbert Grosdidier wrote: > Bonjour Ralph, > > Thanks for taking time to help me. > > Le 15 déc. 10 à 16:27, Ralph Castain a écrit : > >> It would appear that there is something trying to talk to a socket opened by >> one of your daemons. At a guess, I would bet the problem is that a prior job >> left a daemon alive that is talking on the same socket. > > gg= At first glance, this could be possible, although I got no evidence > about it when looking for ghost processes of mine on the relevant nodes. > >> >> Are you by chance using static ports for the job? > > gg= How could I know that ? > Is there an easy way to workaround these static ports ? > Would it prevent the jobs to collide ghost jobs/processes as suggested below, > please ? > I did not spot any info about static ports inside of ompi_info output ... ;-)
It wouldn't happen by default - you would have had to tell us to use static ports by specifying an OOB port range. If you didn't do that (and remember, it could be in a default mca param file!), then the ports are dynamically assigned. > >> Did you run another job just before this one that might have left a daemon >> somewhere? > > gg= Again, it could be possible that with my many jobs crashing over the > cluster, > PBS was unable to clean up the nodes in time before restarting a new one. > But I have no evidence. > > The exact full error message was like this: > [r36i3n15:18992] [[1468,0],254]-[[1468,0],14] > mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier > [[1468,1],1643] > > From some debug info I got, process 1468 seems to relate to node rank 0 > (r33i0n0), > while process 1643 seems to originates from node r36i0n14. The "1468,1" is an arbitrary identifier for the overall job. The "1643" indicates that it is an MPI process (rank=1643) within that job that provided the bad identifier. The "1468,0" identifiers in the early part of the message indicate that the error occurred on a port being used by two ORTE daemons for communication. Somehow, an MPI process (rank=1643) injected a message into that link. It looks like all the messages are flowing within a single job (all three processes mentioned in the error have the same identifier). Only possibility I can think of is that somehow you are reusing ports - is it possible your system doesn't have enough ports to support all the procs? I confess I'm a little at a loss - never seen this problem before, and we run on very large clusters. > > But, indeed, none of r33i0n0, r36i0n14 or r36i3n15 exhibits any process like > 1468 or 1643, > while process 18992 is indeed the master one on r36i3n15. > > Thanks, Best, G. > > >> >> >> On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote: >> >>> Bonjour, >>> >>> Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got >>> this error message, right at startup : >>> mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier >>> [[13816,0],209] >>> >>> and the whole job is going to spin for an undefined period, without >>> crashing/aborting. >>> >>> What could be the culprit please ? >>> Is there a workaround ? >>> Which parameter is to be tuned ? >>> >>> Thanks in advance for any help, Best, G. >>> > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users