Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

Ralph Castain Wed, 15 Dec 2010 12:45:40 -0500

On Dec 15, 2010, at 10:14 AM, Gilbert Grosdidier wrote:

> Bonjour Ralph,
> 
>  Thanks for taking time to help me.
> 
> Le 15 déc. 10 à 16:27, Ralph Castain a écrit :
> 
>> It would appear that there is something trying to talk to a socket opened by 
>> one of your daemons. At a guess, I would bet the problem is that a prior job 
>> left a daemon alive that is talking on the same socket.
> 
> gg= At first glance, this could be possible, although I got no evidence
> about it when looking for ghost processes of mine on the relevant nodes.
> 
>> 
>> Are you by chance using static ports for the job?
> 
> gg= How could I know that ?
> Is there an easy way to workaround these static ports ? 
> Would it prevent the jobs to collide ghost jobs/processes as suggested below, 
> please ?
> I did not spot any info about static ports inside of ompi_info output ... ;-)

It wouldn't happen by default - you would have had to tell us to use static 
ports by specifying an OOB port range. If you didn't do that (and remember, it 
could be in a default mca param file!), then the ports are dynamically assigned.

> 
>> Did you run another job just before this one that might have left a daemon 
>> somewhere?
> 
> gg= Again, it could be possible that with my many jobs crashing over the 
> cluster,
> PBS was unable to clean up the nodes in time before restarting a new one.
> But I have no evidence.
> 
>  The exact full error message was like this:
> [r36i3n15:18992] [[1468,0],254]-[[1468,0],14] 
> mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier 
> [[1468,1],1643]
> 
>  From some debug info I got, process 1468 seems to relate to node rank 0 
> (r33i0n0),
> while process 1643 seems to originates from node r36i0n14.

The "1468,1" is an arbitrary identifier for the overall job. The "1643" 
indicates that it is an MPI process (rank=1643) within that job that provided 
the bad identifier.

The "1468,0" identifiers in the early part of the message indicate that the 
error occurred on a port being used by two ORTE daemons for communication. 
Somehow, an MPI process (rank=1643) injected a message into that link.

It looks like all the messages are flowing within a single job (all three 
processes mentioned in the error have the same identifier). Only possibility I 
can think of is that somehow you are reusing ports - is it possible your system 
doesn't have enough ports to support all the procs?

I confess I'm a little at a loss  - never seen this problem before, and we run 
on very large clusters.

> 
>  But, indeed, none of r33i0n0, r36i0n14 or r36i3n15 exhibits any process like 
> 1468 or 1643,
> while process 18992 is indeed the master one on r36i3n15.
> 
>  Thanks,   Best,    G.
> 
> 
>> 
>> 
>> On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:
>> 
>>> Bonjour,
>>> 
>>> Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got
>>> this error message, right at startup :
>>> mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier 
>>> [[13816,0],209]
>>> 
>>> and the whole job is going to spin for an undefined period, without 
>>> crashing/aborting.
>>> 
>>> What could be the culprit please ?
>>> Is there a workaround ?
>>> Which parameter is to be tuned ?
>>> 
>>> Thanks in advance for any help,    Best,    G.
>>> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

Reply via email to