On Apr 29, 2014, at 4:28 PM, Vince Grimes <tom.gri...@ttu.edu> wrote:
> I realize it is no longer in the history of replies for this message, but the > reason I am trying to use tcp instead of Infiniband is because: > > We are using an in-house program called ScalIT that performs operations on > very large sparse distributed matrices. > ScalIT works on other clusters with comparable hardware and software, but not > ours. > Other programs run just fine on our cluster using OpenMPI. > ScalIT runs to completion using OpenMPI *on a single 12-core node*. > > It was suggested to me by another list member that I try forcing usage of tcp > instead of Infiniband, so that's what I've been trying, just to see if it > will work. I guess the tcp code is expected to be more reliable? No, but it *should* be easier to configure... We have previously seen instability of the IP-over-IB drivers, but I haven't been directly involved in the IB community for years, so that information may well be dated. > The mca parameters used to produce the current error are: "--mca btl > self,sm,tcp --mca btl_tcp_if_exclude lo,ib0" > > The previous Infiniband error message is: > --- > local QP operation err (QPN 7c1d43, WQE @ 00015005, CQN 7a009a, index 307512) > [ 0] 007c1d43 > [ 4] 00000000 > [ 8] 00000000 > [ c] 00000000 > [10] 026b2ed0 > [14] 00000000 > [18] 00015005 > [1c] ff100000 > [[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local > to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR > status number 2 for wr_id 246f300 opcode 128 vendor error 107 qp_idx 0 > --- > > It was also suggested that I disable eager RDMA. Doing this ("--mca > btl_openib_use_eager_rdma 0") results in: > --- > [[30430,1],234][btl_openib_component.c:3492:handle_wc] from > compute-1-18.local to: compute-6-10 error polling HP CQ with status WORK > REQUEST FLUSHED ERROR status number 5 for wr_id 2c41e80 opcode 128 vendor > error 244 qp_idx 0 > --- > > All the Infiniband errors come in the same place with respect to the program > output and reference the same OpenMPI code line. (It is notoriously difficult > to trace through this program to be sure of the location in the code where > the error occurs as ScalIT is written in appalling FORTRAN.) Do you know for sure that this is a correct MPI application? The errors you describe above may well be due to IB layer-0 kinds of errors (e.g., bad cables and/or bad HCAs), or they could be due to application errors (e.g., memory corruption). I say this because if you're getting hangs in TCP and errors with IB, it could be that the application itself is faulty... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/