What version of OMPI are you using? The job should terminate in either case - 
what did you do to keep it running after node failure with tcp?


On Sep 23, 2011, at 12:34 PM, Guilherme V wrote:

> Hi,
> I want to know if anybody is having problems with fault tolerant job using 
> infiniband. When I run my job with tcp if anything happens with one node, my 
> job keeps running, but if I change my job to use infiniband if anything 
> happens with the infiniband (i.e cable problems) my job fails.
> 
> Anybody knows if there is something different that need to be done when using 
> openib instead tcp?  
> 
> Bellow a example of the message I'm receiving from the mpi.
> 
> Regards,
> Guilherme
> 
> --------------------------------------------------------------------------    
>                                                                               
>               
> The OpenFabrics stack has reported a network error event.  Open MPI           
>                                                                               
>               
> will try to continue, but your job may end up failing.                        
>                                                                               
>               
> 
>   Local host:        XXXXX
>   MPI process PID:   23341                         
>   Error number:      10 (IBV_EVENT_PORT_ERR)       
> 
> This error may indicate connectivity problems within the fabric;
> please contact your system administrator.                       
> --------------------------------------------------------------------------
> [ZZZZ:23320] 15 more processes have sent help message help-mpi-btl-openib.txt 
> / of error event
> [WWW:23320] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 
> / error messages
> [[4089,1],144][btl_openib_component.c:3227:handle_wc] from XXXXX to: YYYYY 
> error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for 
> wr_id 214283560 opcode 51  vendor error 129 qp_idx 3
> [[4089,1],147][btl_openib_component.c:3227:handle_wc] from XXXXX to: YYYYY 
> error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for 
> wr_id 490884096 opcode 1  vendor error 129 qp_idx 0
> --------------------------------------------------------------------------
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
> 
>     The total number of times that the sender wishes the receiver to
>     retry timeout, packet sequence, etc. errors before posting a
>     completion error.
> 
> This error typically means that there is something awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
> 
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will
>   attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>   to 10).  The actual timeout value used is calculated as:
> 
>      4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>   See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> 
> Below is some information about the host that raised the error and the
> peer to which it was connected:
> 
>   Local host:   XXXX
>   Local device: mlx4_0
>   Peer host:    YYYY
> 
> You may need to consult with your system administrator to get this
> problem fixed.
> --------------------------------------------------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to