On 7/10/07 3:56 PM, "Glenn Carver" wrote:
> Brian, Ralph,
>
> I neglected to mention in my first email that the application hasn't
> completed when I see the "HNP lost" messages. All processes of the
> pplication are still running on the nodes (well consuming cpu cycles
> really). I should c
Brian, Ralph,
I neglected to mention in my first email that the application hasn't
completed when I see the "HNP lost" messages. All processes of the
application are still running on the nodes (well consuming cpu cycles
really). I should check to see if mpirun is still there.
Further invest
What Ralph said is generally true. If your application completed,
this is nothing to worry about. It means that an error occurred on
the socket between mpirun ad some other process. However, combind
with the travor0 errors in the log files, it could mean that your
IPoIB network is acting
On 7/10/07 11:08 AM, "Glenn Carver" wrote:
> Hi,
>
> I'd be grateful if someone could explain the meaning of this error
> message to me and whether it indicates a hardware problem or
> application software issue:
>
> [node2:11881] OOB: Connection to HNP lost
> [node1:09876] OOB: Connection t