Re: [OMPI users] Job fails after hours of running on a specific node

Sangamesh B Mon, 7 Dec 2009 04:14:17 -0500

Hello Pasha,

          As the error was not repeating frequently, I didn't look into the
issue from a long time. But now I started to diagnose it:


Initially I tested with ibv_rc_pingpong (Master node to all compute nodes
using a for loop). Its working for each of the nodes.

The files generated out of the command "ibdiagnet -v -r -o ." are attached
herewith. The ibcheckerrors shows following error message:

# ibcheckerrors
#warn: counter RcvSwRelayErrors = 408   (threshold 100)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port
all:  FAILED
#warn: counter RcvSwRelayErrors = 179   (threshold 100)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port
7:  FAILED
# Checked Switch: nodeguid 0x000b8cffff00551b with failure

## Summary: 25 nodes checked, 0 bad nodes found
##          48 ports checked, 1 ports have errors beyond threshold

Are these messages helpful to  find the issue with node-0-2? Can you please
help us to diagnose further?

Thanks,
Sangamesh


On Mon, Sep 21, 2009 at 1:36 PM, Pavel Shamis (Pasha) <pash...@gmail.com>wrote:

> Sangamesh,
>
> The ib tunings that you added to your command line only delay the problem
> but doesn't resolve it.
> The node-0-2.local gets asynchronous event "IBV_EVENT_PORT_ERROR" and as
> result
> the processes fails to deliver packets to some remote hosts and as result
> you see bunch of IB errors.
>
> The IBV_EVENT_PORT_ERROR error means that the IB port gone from ACTIVE
> state do DOWN state.
> Or in other words you have problem with your IB networks that cause all
> these networks errors.
> Source cause of such issue maybe some bad cable or some problematic port on
> switch.
>
> For the IB network debug I propose you use Ibdiaget, it is open source IB
> network diagnostic tool :
> http://linux.die.net/man/1/ibdiagnet
> The tool is part of OFED distribution.
>
> Pasha.
>
>
> Sangamesh B wrote:
>
>> Dear all,
>>       The CPMD application which is compiled with OpenMPI-1.3 (Intel 10.1
>> compilers) on CentOS-4.5, fails only, when a specific node i.e. node-0-2 is
>> involved. But runs well on other nodes.
>>        Initially job failed after 5-10 mins (on node-0-2 + some other
>> nodes). After googling error, I added options "-mca
>> btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20" to mpirun
>> command in the SGE script:
>>  $ cat cpmdrun.sh
>> #!/bin/bash
>> #$ -N cpmd-acw
>> #$ -S /bin/bash
>> #$ -cwd
>> #$ -e err.$JOB_ID.$JOB_NAME
>> #$ -o out.$JOB_ID.$JOB_NAME
>> #$ -pe ib 32
>> unset SGE_ROOT  PP_LIBRARY=/home/user1/cpmdrun/wac/prod/PP
>> CPMD=/opt/apps/cpmd/3.11/ompi/SOURCE/cpmd311-ompi-mkl.x
>> MPIRUN=/opt/mpi/openmpi/1.3/intel/bin/mpirun
>> $MPIRUN -np $NSLOTS -hostfile $TMPDIR/machines -mca
>> btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20 $CPMD
>> wac_md26.in <http://wac_md26.in>  $PP_LIBRARY > wac_md26.out
>>
>> After adding these options, job executed for 24+ hours then failed with
>> the same error as earlier. The error is:
>>  $ cat err.6186.cpmd-acw
>> --------------------------------------------------------------------------
>> The OpenFabrics stack has reported a network error event.  Open MPI
>> will try to continue, but your job may end up failing.
>>  Local host:        node-0-2.local
>>  MPI process PID:   11840
>>  Error number:      10 (IBV_EVENT_PORT_ERR)
>> This error may indicate connectivity problems within the fabric;
>> please contact your system administrator.
>> --------------------------------------------------------------------------
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] Set MCA parameter "orte_base_help_aggregate" to 0
>> to see all help / error messages
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 15 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 16 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 16 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [[718,1],20][btl_openib_component.c:2902:handle_wc] from node-0-22.local
>> to: node-0-2
>> --------------------------------------------------------------------------
>> The InfiniBand retry count between two MPI processes has been
>> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
>> (section 12.7.38):
>>    The total number of times that the sender wishes the receiver to
>>    retry timeout, packet sequence, etc. errors before posting a
>>    completion error.
>> This error typically means that there is something awry within the
>> InfiniBand fabric itself.  You should note the hosts on which this
>> error has occurred; it has been observed that rebooting or removing a
>> particular host from the job can sometimes resolve this issue.
>> Two MCA parameters can be used to control Open MPI's behavior with
>> respect to the retry count:
>> * btl_openib_ib_retry_count - The number of times the sender will
>>  attempt to retry (defaulted to 7, the maximum value).
>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>>  to 10).  The actual timeout value used is calculated as:
>>     4.096 microseconds * (2^btl_openib_ib_timeout)
>>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>> Below is some information about the host that raised the error and the
>> peer to which it was connected:
>>  Local host:   node-0-22.local
>>  Local device: mthca0
>>  Peer host:    node-0-2
>> You may need to consult with your system administrator to get this
>> problem fixed.
>> --------------------------------------------------------------------------
>> error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for
>> wr_id 66384128 opcode 128 qp_idx 3
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 20 with PID 10425 on
>> node ibc22 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> rm: cannot remove `/tmp/6186.1.iblong.q/rsh': No such file or directory
>>  The openibd service is running fine:
>>  $ service openibd status
>>  HCA driver loaded
>> Configured devices:
>> ib0
>> Currently active devices:
>> ib0
>> The following OFED modules are loaded:
>>  rdma_ucm
>>  ib_sdp
>>  rdma_cm
>>  ib_addr
>>  ib_ipoib
>>  mlx4_core
>>  mlx4_ib
>>  ib_mthca
>>  ib_uverbs
>>  ib_umad
>>  ib_ucm
>>  ib_sa
>>  ib_cm
>>  ib_mad
>>  ib_core
>> But still the job is failing after hours of running, that to for a
>> particular node. What's the wrong with node-0-2? How can it be resolved?
>>  Thanks,
>> Sangamesh
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

IBtest_ibdiagnet.tar.gz
Description: GNU Zip compressed data

Re: [OMPI users] Job fails after hours of running on a specific node

Reply via email to