You may try to use ibdiagnet tool:
http://linux.die.net/man/1/ibdiagnet
The tool is part of OFED (http://www.openfabrics.org/)
Pasha.
Prentice Bisbal wrote:
Several jobs on my cluster just died with the error below.
Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just
Several jobs on my cluster just died with the error below.
Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just reboot the nodes, or should I have the user who submitted these
jobs just increase the retry count/timeout paramters?
[0,1,6][../../../../../ompi/mca/btl/openi
Thanks Pasha!
ibdiagnet reports the following:
-I---
-I- IPoIB Subnets Check
-I---
-I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Port localhost/P1 lid=0x00e2 guid=
On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote:
>
> >Time to dig up diagnostics tools and look at port statistics.
> >
> You may use ibdiagnet tool for the network debug -
> *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
>
> Pasha.
> __
Time to dig up diagnostics tools and look at port statistics.
You may use ibdiagnet tool for the network debug -
*http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
Pasha.
On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote:
> On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:
>
> >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> >> This *usually* indicates a physical / layer 0 problem in your IB
> >> fabric. You should do a diagnostic on your
On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:
On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB
> fabric. You should do a diagnostic on your HCAs, cables, and
switches.
>
> Increasing the timeout value should on
On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB
> fabric. You should do a diagnostic on your HCAs, cables, and switches.
>
> Increasing the timeout value should only be necessary on very large IB
> fabrics and/or
This *usually* indicates a physical / layer 0 problem in your IB
fabric. You should do a diagnostic on your HCAs, cables, and switches.
Increasing the timeout value should only be necessary on very large IB
fabrics and/or very congested networks.
On Mar 4, 2009, at 3:28 PM, Jan Lindheim w
I found several reports on the openmpi users mailing list from users,
who need to bump up the default value for btl_openib_ib_timeout.
We also have some applications on our cluster, that have problems,
unless we set this value from the default 10 to 15:
[24426,1],122][btl_openib_component.c:2905:
10 matches
Mail list logo