On Apr 29, 2014, at 4:28 PM, Vince Grimes <tom.gri...@ttu.edu> wrote:

> I realize it is no longer in the history of replies for this message, but the 
> reason I am trying to use tcp instead of Infiniband is because:
> 
> We are using an in-house program called ScalIT that performs operations on 
> very large sparse distributed matrices.
> ScalIT works on other clusters with comparable hardware and software, but not 
> ours.
> Other programs run just fine on our cluster using OpenMPI.
> ScalIT runs to completion using OpenMPI *on a single 12-core node*.
> 
> It was suggested to me by another list member that I try forcing usage of tcp 
> instead of Infiniband, so that's what I've been trying, just to see if it 
> will work. I guess the tcp code is expected to be more reliable?

No, but it *should* be easier to configure...

We have previously seen instability of the IP-over-IB drivers, but I haven't 
been directly involved in the IB community for years, so that information may 
well be dated.

> The mca parameters used to produce the current error are: "--mca btl 
> self,sm,tcp --mca btl_tcp_if_exclude lo,ib0"
> 
>       The previous Infiniband error message is:
> ---
> local QP operation err (QPN 7c1d43, WQE @ 00015005, CQN 7a009a, index 307512)
>  [ 0] 007c1d43
>  [ 4] 00000000
>  [ 8] 00000000
>  [ c] 00000000
>  [10] 026b2ed0
>  [14] 00000000
>  [18] 00015005
>  [1c] ff100000
> [[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local 
> to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR 
> status number 2 for wr_id 246f300 opcode 128  vendor error 107 qp_idx 0
> ---
> 
> It was also suggested that I disable eager RDMA. Doing this ("--mca 
> btl_openib_use_eager_rdma 0") results in:
> ---
> [[30430,1],234][btl_openib_component.c:3492:handle_wc] from 
> compute-1-18.local to: compute-6-10 error polling HP CQ with status WORK 
> REQUEST FLUSHED ERROR status number 5 for wr_id 2c41e80 opcode 128 vendor 
> error 244 qp_idx 0
> ---
> 
> All the Infiniband errors come in the same place with respect to the program 
> output and reference the same OpenMPI code line. (It is notoriously difficult 
> to trace through this program to be sure of the location in the code where 
> the error occurs as ScalIT is written in appalling FORTRAN.)

Do you know for sure that this is a correct MPI application?

The errors you describe above may well be due to IB layer-0 kinds of errors 
(e.g., bad cables and/or bad HCAs), or they could be due to application errors 
(e.g., memory corruption).

I say this because if you're getting hangs in TCP and errors with IB, it could 
be that the application itself is faulty...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to