Hi, Vince Have you tried with a different BTL? In particular, have you tried with the TCP BTL? Please try setting "-mca btl sm,self,tcp" and see if you still run into the issue.
How is your OMPI configured? Josh > From: Vince Grimes <tom.gri...@ttu.edu> > Subject: [OMPI users] LOCAL QP OPERATION ERROR > Date: March 5, 2014 5:21:51 PM EST > To: <us...@open-mpi.org> > Reply-To: Open MPI Users <us...@open-mpi.org> > > OpenMPI folks: > > I am having trouble running a specific program (ScalIT, a code produced > and maintained by one of the research groups here at TTU) using Infiniband. > There is conflicting information that has made it impossible to diagnose the > problem: > > 1) Other programs (like NWChem) run using OpenMPI over multiple nodes using > Infiniband without any problems at all. > > 2) ScalIT runs on other clusters (and I believe with OpenMPI) without error. > > 3) ScalIT runs with OpenMPI on a single node without error. > > 4) ScalIT dies at a particular place with OpenMPI over multiple nodes (20) > with OpenMPI. > > I don't know whether it is a hardware problem (but other codes work just > fine) or a programming error in ScalIT (but it works without modification on > other clusters). > > The error I am getting is: > local QP operation err (QPN 0014bc, WQE @ 00009005, CQN 000097, index > 2232620) [ 0] 000014bc [ 4] 00000000 [ 8] 00000000 [ c] 00000000 > [10] 026f3410 [14] 00000000 [18] 00009005 [1c] ff100000 > [[44095,1],45][btl_openib_component.c:3492:handle_wc] from > compute-6-13.local to: compute-3-11 error polling LP CQ with status > LOCAL QP OPERATION ERROR status number 2 for wr_id 40c5e00 opcode 0 > vendor error 111 qp_idx 0 > ---------------------------------------------------------------------- > ---- mpirun has exited due to process rank 45 with PID 27168 on node > compute-6-13.local exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in the > job did. This can cause a job to hang indefinitely while it waits for > all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > ---------------------------------------------------------------------- > ---- > > I am using OpenMPI 1.6.5 compiled with the Intel 11.1-080 compilers. > > `uname -a` returns "Linux compute-1-1.local 2.6.32-279.14.1.el6.x86_64 #1 SMP > Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux" > > ibv_devinfo returns > hca_id: mthca0 > transport: InfiniBand (0) > fw_ver: 1.2.0 > node_guid: 0005:ad00:001f:fed8 > sys_image_guid: 0005:ad00:0100:d050 > vendor_id: 0x02c9 > vendor_part_id: 25204 > hw_ver: 0xA0 > board_id: MT_03B0120002 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 1 > port_lid: 39 > port_lmc: 0x00 > link_layer: IB > > > Any help in tracking down the problem is greatly appreciated. > > -- > T. Vince Grimes, Ph.D. > CCC System Administrator > > Texas Tech University > Dept. of Chemistry and Biochemistry (10A) Box 41061 Lubbock, TX > 79409-1061 > > (806) 834-0813 (voice); (806) 742-1289 (fax) > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/