On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote: > > >Time to dig up diagnostics tools and look at port statistics. > > > You may use ibdiagnet tool for the network debug - > *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. > > Pasha. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Thanks Pasha! ibdiagnet reports the following: -I--------------------------------------------------- -I- IPoIB Subnets Check -I--------------------------------------------------- -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Port localhost/P1 lid=0x00e2 guid=0x001e0bffff4ced75 dev=25218 can not join due to rate:2.5Gbps < group:10Gbps I guess this may indicate a bad adapter. Now, I just need to find what system this maps to. I also ran ibcheckerrors and it reports a lot of problems with buffer overruns. Here's the tail end of the output, with only some of the last ports reported: #warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14 #warn: counter LinkDowned = 23 (threshold 10) lid 193 port 14 #warn: counter RcvErrors = 15641 (threshold 10) lid 193 port 14 #warn: counter RcvSwRelayErrors = 225 (threshold 100) lid 193 port 14 #warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14: FAILED #warn: counter LinkRecovers = 181 (threshold 10) lid 193 port 1 #warn: counter RcvSwRelayErrors = 2417 (threshold 100) lid 193 port 1 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1: FAILED #warn: counter LinkRecovers = 103 (threshold 10) lid 193 port 3 #warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3 #warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3: FAILED #warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4 #warn: counter RcvErrors = 109 (threshold 10) lid 193 port 4 #warn: counter RcvSwRelayErrors = 507 (threshold 100) lid 193 port 4 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4: FAILED ## Summary: 209 nodes checked, 0 bad nodes found ## 716 ports checked, 103 ports have errors beyond threshold I wonder if this is something that needs to be tuned in the Infiniband switch or if there is something in OpenMPI/OpenIB that can be tuned. Thanks, Jan Lindheim