Re: [OMPI users] jobs are hanging with btl_openib_component error

2013-06-17 Thread Shamis, Pavel
You may use tools like this http://linux.die.net/man/1/ibdiagnet to debug your ib network problems. Most likely, you have some bad cable or connector somewhere in the network. The tool should be able to pin-point the problem. Pavel (Pasha) Shamis --- Computer Science Research Group Computer Scien

Re: [OMPI users] jobs are hanging with btl_openib_component error

2013-06-17 Thread Jeff Squyres (jsquyres)
That sounds like there's a problem with your InfiniBand fabric. You should run a complete level-0 diagnostic on your IB network. On Jun 17, 2013, at 5:23 AM, "Singh, Bharati (GE Global Research, consultant)" wrote: > Hi Team, > > Our users jobs are hanging and we notice below errors. >

[OMPI users] jobs are hanging with btl_openib_component error

2013-06-17 Thread Singh, Bharati (GE Global Research, consultant)
Hi Team, Our users jobs are hanging and we notice below errors. [[61410,1],65][btl_openib_component.c:3238:handle_wc] from bng1aviationdc22 to: bng1aviationdc26 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 774739584 opcode 1 vendor error 129 qp_idx 0