This typically indicates an error in the physical layer of your IB network. You should run layer 0 diagnostics and look for bad cables, bad HCAs, etc.
On Oct 18, 2013, at 1:49 AM, "sudhirs@" <sudhirche...@gmail.com> wrote: > Dear open-mpi user, > I am running a CPMD calculation in parallel. I got the following error and > job got killed. Below I have given the error message. What is this error and > how to fix it ? > > > [[12065,1],23][btl_openib_component.c:2948:handle_wc] from compute-0-0.local > to: compute-0-7 error polling LP CQ with status RETRY EXCEEDED ERROR status > number 12 for wr_id 396116864 opcode 0 vendor error 129 qp_idx 1 > -------------------------------------------------------------------------- > The InfiniBand retry count between two MPI processes has been > exceeded. "Retry count" is defined in the InfiniBand spec 1.2 > (section 12.7.38): > > The total number of times that the sender wishes the receiver to > retry timeout, packet sequence, etc. errors before posting a > completion error. > > This error typically means that there is something awry within the > InfiniBand fabric itself. You should note the hosts on which this > error has occurred; it has been observed that rebooting or removing a > particular host from the job can sometimes resolve this issue. > > Two MCA parameters can be used to control Open MPI's behavior with > respect to the retry count: > > * btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 10). The actual timeout value used is calculated as: > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > See the InfiniBand spec 1.2 (section 12.7.34) for more details. > > Below is some information about the host that raised the error and the > peer to which it was connected: > > Local host: compute-0-0.local > Local device: mthca0 > Peer host: compute-0-7 > > You may need to consult with your system administrator to get this > problem fixed. > > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun has exited due to process rank 23 with PID 24240 on > node compute-0-0 exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line Source > mca_btl_openib.so 00002AD8CFE0DED0 Unknown Unknown Unknown > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line Source > mca_btl_sm.so 00002B316684B029 Unknown Unknown Unknown > libopen-pal.so.0 00002B3162A0FD97 Unknown Unknown Unknown > libmpi.so.0 00002B31625008B6 Unknown Unknown Unknown > mca_coll_tuned.so 00002B3167902A3E Unknown Unknown Unknown > mca_coll_tuned.so 00002B31678FF6F5 Unknown Unknown Unknown > libmpi.so.0 00002B31625178C6 Unknown Unknown Unknown > libmpi_f77.so.0 00002B31622B7725 Unknown Unknown Unknown > cpmd.x 0000000000808017 Unknown Unknown Unknown > cpmd.x 0000000000805AF8 Unknown Unknown Unknown > cpmd.x 000000000050C49D Unknown Unknown Unknown > cpmd.x 00000000005B6FC8 Unknown Unknown Unknown > cpmd.x 000000000051D5DE Unknown Unknown Unknown > cpmd.x 00000000005B3557 Unknown Unknown Unknown > cpmd.x 000000000095817C Unknown Unknown Unknown > cpmd.x 0000000000959557 Unknown Unknown Unknown > cpmd.x 0000000000657E07 Unknown Unknown Unknown > cpmd.x 000000000046F2D1 Unknown Unknown Unknown > cpmd.x 000000000046EF6C Unknown Unknown Unknown > libc.so.6 0000003F34E1D974 Unknown Unknown Unknown > cpmd.x 000000000046EE79 Unknown Unknown Unknown > > > Thanking you > -- > Sudhir Kumar Sahoo > Ph.D Scholar > Dept. Of Chemistry > IIT Kanpur-208016 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/