Hello Pasha, As the error was not repeating frequently, I didn't look into the issue from a long time. But now I started to diagnose it:
Initially I tested with ibv_rc_pingpong (Master node to all compute nodes using a for loop). Its working for each of the nodes. The files generated out of the command "ibdiagnet -v -r -o ." are attached herewith. The ibcheckerrors shows following error message: # ibcheckerrors #warn: counter RcvSwRelayErrors = 408 (threshold 100) Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED #warn: counter RcvSwRelayErrors = 179 (threshold 100) Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port 7: FAILED # Checked Switch: nodeguid 0x000b8cffff00551b with failure ## Summary: 25 nodes checked, 0 bad nodes found ## 48 ports checked, 1 ports have errors beyond threshold Are these messages helpful to find the issue with node-0-2? Can you please help us to diagnose further? Thanks, Sangamesh On Mon, Sep 21, 2009 at 1:36 PM, Pavel Shamis (Pasha) <pash...@gmail.com>wrote: > Sangamesh, > > The ib tunings that you added to your command line only delay the problem > but doesn't resolve it. > The node-0-2.local gets asynchronous event "IBV_EVENT_PORT_ERROR" and as > result > the processes fails to deliver packets to some remote hosts and as result > you see bunch of IB errors. > > The IBV_EVENT_PORT_ERROR error means that the IB port gone from ACTIVE > state do DOWN state. > Or in other words you have problem with your IB networks that cause all > these networks errors. > Source cause of such issue maybe some bad cable or some problematic port on > switch. > > For the IB network debug I propose you use Ibdiaget, it is open source IB > network diagnostic tool : > http://linux.die.net/man/1/ibdiagnet > The tool is part of OFED distribution. > > Pasha. > > > Sangamesh B wrote: > >> Dear all, >> The CPMD application which is compiled with OpenMPI-1.3 (Intel 10.1 >> compilers) on CentOS-4.5, fails only, when a specific node i.e. node-0-2 is >> involved. But runs well on other nodes. >> Initially job failed after 5-10 mins (on node-0-2 + some other >> nodes). After googling error, I added options "-mca >> btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20" to mpirun >> command in the SGE script: >> $ cat cpmdrun.sh >> #!/bin/bash >> #$ -N cpmd-acw >> #$ -S /bin/bash >> #$ -cwd >> #$ -e err.$JOB_ID.$JOB_NAME >> #$ -o out.$JOB_ID.$JOB_NAME >> #$ -pe ib 32 >> unset SGE_ROOT PP_LIBRARY=/home/user1/cpmdrun/wac/prod/PP >> CPMD=/opt/apps/cpmd/3.11/ompi/SOURCE/cpmd311-ompi-mkl.x >> MPIRUN=/opt/mpi/openmpi/1.3/intel/bin/mpirun >> $MPIRUN -np $NSLOTS -hostfile $TMPDIR/machines -mca >> btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20 $CPMD >> wac_md26.in <http://wac_md26.in> $PP_LIBRARY > wac_md26.out >> >> After adding these options, job executed for 24+ hours then failed with >> the same error as earlier. The error is: >> $ cat err.6186.cpmd-acw >> -------------------------------------------------------------------------- >> The OpenFabrics stack has reported a network error event. Open MPI >> will try to continue, but your job may end up failing. >> Local host: node-0-2.local >> MPI process PID: 11840 >> Error number: 10 (IBV_EVENT_PORT_ERR) >> This error may indicate connectivity problems within the fabric; >> please contact your system administrator. >> -------------------------------------------------------------------------- >> [node-0-2.local:11836] 7 more processes have sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] Set MCA parameter "orte_base_help_aggregate" to 0 >> to see all help / error messages >> [node-0-2.local:11836] 1 more process has sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 7 more processes have sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 1 more process has sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 7 more processes have sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 1 more process has sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 7 more processes have sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 1 more process has sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 7 more processes have sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 1 more process has sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 15 more processes have sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 16 more processes have sent help message >> help-mpi-btl-openib.txt / of error event >> [node-0-2.local:11836] 16 more processes have sent help message >> help-mpi-btl-openib.txt / of error event >> [[718,1],20][btl_openib_component.c:2902:handle_wc] from node-0-22.local >> to: node-0-2 >> -------------------------------------------------------------------------- >> The InfiniBand retry count between two MPI processes has been >> exceeded. "Retry count" is defined in the InfiniBand spec 1.2 >> (section 12.7.38): >> The total number of times that the sender wishes the receiver to >> retry timeout, packet sequence, etc. errors before posting a >> completion error. >> This error typically means that there is something awry within the >> InfiniBand fabric itself. You should note the hosts on which this >> error has occurred; it has been observed that rebooting or removing a >> particular host from the job can sometimes resolve this issue. >> Two MCA parameters can be used to control Open MPI's behavior with >> respect to the retry count: >> * btl_openib_ib_retry_count - The number of times the sender will >> attempt to retry (defaulted to 7, the maximum value). >> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted >> to 10). The actual timeout value used is calculated as: >> 4.096 microseconds * (2^btl_openib_ib_timeout) >> See the InfiniBand spec 1.2 (section 12.7.34) for more details. >> Below is some information about the host that raised the error and the >> peer to which it was connected: >> Local host: node-0-22.local >> Local device: mthca0 >> Peer host: node-0-2 >> You may need to consult with your system administrator to get this >> problem fixed. >> -------------------------------------------------------------------------- >> error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for >> wr_id 66384128 opcode 128 qp_idx 3 >> -------------------------------------------------------------------------- >> mpirun has exited due to process rank 20 with PID 10425 on >> node ibc22 exiting without calling "finalize". This may >> have caused other processes in the application to be >> terminated by signals sent by mpirun (as reported here). >> -------------------------------------------------------------------------- >> rm: cannot remove `/tmp/6186.1.iblong.q/rsh': No such file or directory >> The openibd service is running fine: >> $ service openibd status >> HCA driver loaded >> Configured devices: >> ib0 >> Currently active devices: >> ib0 >> The following OFED modules are loaded: >> rdma_ucm >> ib_sdp >> rdma_cm >> ib_addr >> ib_ipoib >> mlx4_core >> mlx4_ib >> ib_mthca >> ib_uverbs >> ib_umad >> ib_ucm >> ib_sa >> ib_cm >> ib_mad >> ib_core >> But still the job is failing after hours of running, that to for a >> particular node. What's the wrong with node-0-2? How can it be resolved? >> Thanks, >> Sangamesh >> ------------------------------------------------------------------------ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
IBtest_ibdiagnet.tar.gz
Description: GNU Zip compressed data