Re: [OMPI users] Connection to HNP lost

2007-07-10 Thread Ralph Castain
On 7/10/07 3:56 PM, "Glenn Carver" wrote: > Brian, Ralph, > > I neglected to mention in my first email that the application hasn't > completed when I see the "HNP lost" messages. All processes of the > pplication are still running on the nodes (well consuming cpu cycles > really). I should c

Re: [OMPI users] Connection to HNP lost

2007-07-10 Thread Glenn Carver
Brian, Ralph, I neglected to mention in my first email that the application hasn't completed when I see the "HNP lost" messages. All processes of the application are still running on the nodes (well consuming cpu cycles really). I should check to see if mpirun is still there. Further invest

Re: [OMPI users] Connection to HNP lost

2007-07-10 Thread Brian Barrett
What Ralph said is generally true. If your application completed, this is nothing to worry about. It means that an error occurred on the socket between mpirun ad some other process. However, combind with the travor0 errors in the log files, it could mean that your IPoIB network is acting

Re: [OMPI users] Connection to HNP lost

2007-07-10 Thread Ralph H Castain
On 7/10/07 11:08 AM, "Glenn Carver" wrote: > Hi, > > I'd be grateful if someone could explain the meaning of this error > message to me and whether it indicates a hardware problem or > application software issue: > > [node2:11881] OOB: Connection to HNP lost > [node1:09876] OOB: Connection t