OpenMPI folks:

I have mentioned before a problem with an in-house code (ScalIT) that generates the error message

[[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 246f300 opcode 128 vendor error 107 qp_idx 0

at a specific, reproducible point. It was suggested that the error could be due to memory problems, such as the amount of registered memory. I have already corrected the amount of registered memory per the URLs that were given to me. My question today is two-fold:

First, is it possible that ScalIT uses so much memory that there is no memory to register for IB communications? ScalIT is very memory-intensive and has to run distributed just to get a large matrix in memory (split between nodes).

Second, is there a way to trap that error so I can see the call stack, showing the MPI function called and exactly where in the code the error was generated?

--
T. Vince Grimes, Ph.D.
CCC System Administrator

Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061

(806) 834-0813 (voice);     (806) 742-1289 (fax)

Reply via email to