This typically indicates an error in the physical layer of your IB network.  
You should run layer 0 diagnostics and look for bad cables, bad HCAs, etc.


On Oct 18, 2013, at 1:49 AM, "sudhirs@" <sudhirche...@gmail.com> wrote:

> Dear open-mpi user,
> I am running a CPMD calculation in parallel. I got the following error and 
> job got killed. Below I have given the error message. What is this error and 
> how to fix it ?
> 
> 
> [[12065,1],23][btl_openib_component.c:2948:handle_wc] from compute-0-0.local 
> to: compute-0-7 error polling LP CQ with status RETRY EXCEEDED ERROR status 
> number 12 for wr_id 396116864 opcode 0  vendor error 129 qp_idx 1
> --------------------------------------------------------------------------
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
> 
>     The total number of times that the sender wishes the receiver to
>     retry timeout, packet sequence, etc. errors before posting a
>     completion error.
> 
> This error typically means that there is something awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
> 
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will
>   attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>   to 10).  The actual timeout value used is calculated as:
> 
>      4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>   See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> 
> Below is some information about the host that raised the error and the
> peer to which it was connected:
> 
>   Local host:   compute-0-0.local
>   Local device: mthca0
>   Peer host:    compute-0-7
> 
> You may need to consult with your system administrator to get this
> problem fixed.
> 
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 23 with PID 24240 on
> node compute-0-0 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> forrtl: error (78): process killed (SIGTERM)
> Image              PC                Routine            Line        Source
> mca_btl_openib.so  00002AD8CFE0DED0  Unknown               Unknown  Unknown
> forrtl: error (78): process killed (SIGTERM)
> Image              PC                Routine            Line        Source
> mca_btl_sm.so      00002B316684B029  Unknown               Unknown  Unknown
> libopen-pal.so.0   00002B3162A0FD97  Unknown               Unknown  Unknown
> libmpi.so.0        00002B31625008B6  Unknown               Unknown  Unknown
> mca_coll_tuned.so  00002B3167902A3E  Unknown               Unknown  Unknown
> mca_coll_tuned.so  00002B31678FF6F5  Unknown               Unknown  Unknown
> libmpi.so.0        00002B31625178C6  Unknown               Unknown  Unknown
> libmpi_f77.so.0    00002B31622B7725  Unknown               Unknown  Unknown
> cpmd.x             0000000000808017  Unknown               Unknown  Unknown
> cpmd.x             0000000000805AF8  Unknown               Unknown  Unknown
> cpmd.x             000000000050C49D  Unknown               Unknown  Unknown
> cpmd.x             00000000005B6FC8  Unknown               Unknown  Unknown
> cpmd.x             000000000051D5DE  Unknown               Unknown  Unknown
> cpmd.x             00000000005B3557  Unknown               Unknown  Unknown
> cpmd.x             000000000095817C  Unknown               Unknown  Unknown
> cpmd.x             0000000000959557  Unknown               Unknown  Unknown
> cpmd.x             0000000000657E07  Unknown               Unknown  Unknown
> cpmd.x             000000000046F2D1  Unknown               Unknown  Unknown
> cpmd.x             000000000046EF6C  Unknown               Unknown  Unknown
> libc.so.6          0000003F34E1D974  Unknown               Unknown  Unknown
> cpmd.x             000000000046EE79  Unknown               Unknown  Unknown
> 
> 
> Thanking you
> -- 
> Sudhir Kumar Sahoo
> Ph.D Scholar
> Dept. Of Chemistry
> IIT Kanpur-208016
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to