Hi, Vince

Have you tried with a different BTL? In particular, have you tried with the TCP 
BTL? Please try setting "-mca btl sm,self,tcp" and see if you still run into 
the issue.  

How is your OMPI configured?

Josh



> From: Vince Grimes <tom.gri...@ttu.edu>
> Subject: [OMPI users] LOCAL QP OPERATION ERROR
> Date: March 5, 2014 5:21:51 PM EST
> To: <us...@open-mpi.org>
> Reply-To: Open MPI Users <us...@open-mpi.org>
> 
> OpenMPI folks:
> 
>       I am having trouble running a specific program (ScalIT, a code produced 
> and maintained by one of the research groups here at TTU) using Infiniband. 
> There is conflicting information that has made it impossible to diagnose the 
> problem:
> 
> 1) Other programs (like NWChem) run using OpenMPI over multiple nodes using 
> Infiniband without any problems at all.
> 
> 2) ScalIT runs on other clusters (and I believe with OpenMPI) without error.
> 
> 3) ScalIT runs with OpenMPI on a single node without error.
> 
> 4) ScalIT dies at a particular place with OpenMPI over multiple nodes (20) 
> with OpenMPI.
> 
> I don't know whether it is a hardware problem (but other codes work just 
> fine) or a programming error in ScalIT (but it works without modification on 
> other clusters).
> 
> The error I am getting is:
> local QP operation err (QPN 0014bc, WQE @ 00009005, CQN 000097, index
> 2232620)  [ 0] 000014bc  [ 4] 00000000  [ 8] 00000000  [ c] 00000000 
> [10] 026f3410  [14] 00000000  [18] 00009005  [1c] ff100000 
> [[44095,1],45][btl_openib_component.c:3492:handle_wc] from 
> compute-6-13.local to: compute-3-11 error polling LP CQ with status 
> LOCAL QP OPERATION ERROR status number 2 for wr_id 40c5e00 opcode 0 
> vendor error 111 qp_idx 0
> ----------------------------------------------------------------------
> ---- mpirun has exited due to process rank 45 with PID 27168 on node 
> compute-6-13.local exiting improperly. There are two reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in the 
> job did. This can cause a job to hang indefinitely while it waits for 
> all processes to call "init". By rule, if one process calls "init", 
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to 
> exiting or it will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to be 
> terminated by signals sent by mpirun (as reported here).
> ----------------------------------------------------------------------
> ----
> 
> I am using OpenMPI 1.6.5 compiled with the Intel 11.1-080 compilers.
> 
> `uname -a` returns "Linux compute-1-1.local 2.6.32-279.14.1.el6.x86_64 #1 SMP 
> Tue Nov 6 23:43:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux"
> 
> ibv_devinfo returns
> hca_id: mthca0
>        transport:                      InfiniBand (0)
>        fw_ver:                         1.2.0
>        node_guid:                      0005:ad00:001f:fed8
>        sys_image_guid:                 0005:ad00:0100:d050
>        vendor_id:                      0x02c9
>        vendor_part_id:                 25204
>        hw_ver:                         0xA0
>        board_id:                       MT_03B0120002
>        phys_port_cnt:                  1
>                port:   1
>                        state:                  PORT_ACTIVE (4)
>                        max_mtu:                2048 (4)
>                        active_mtu:             2048 (4)
>                        sm_lid:                 1
>                        port_lid:               39
>                        port_lmc:               0x00
>                        link_layer:             IB
> 
> 
> Any help in tracking down the problem is greatly appreciated.
> 
> --
> T. Vince Grimes, Ph.D.
> CCC System Administrator
> 
> Texas Tech University
> Dept. of Chemistry and Biochemistry (10A) Box 41061 Lubbock, TX
> 79409-1061
> 
> (806) 834-0813 (voice);     (806) 742-1289 (fax)
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to