OpenMPI folks:
I have mentioned before a problem with an in-house code (ScalIT) that
generates the error message
[[31552,1],84][btl_openib_component.c:3492:handle_wc] from
compute-4-5.local to: compute-4-13 error polling LP CQ with status LOCAL
QP OPERATION ERROR status number 2 for wr_id 246f300 opcode 128 vendor
error 107 qp_idx 0
at a specific, reproducible point. It was suggested that the error could
be due to memory problems, such as the amount of registered memory. I
have already corrected the amount of registered memory per the URLs that
were given to me. My question today is two-fold:
First, is it possible that ScalIT uses so much memory that there is no
memory to register for IB communications? ScalIT is very
memory-intensive and has to run distributed just to get a large matrix
in memory (split between nodes).
Second, is there a way to trap that error so I can see the call stack,
showing the MPI function called and exactly where in the code the error
was generated?
--
T. Vince Grimes, Ph.D.
CCC System Administrator
Texas Tech University
Dept. of Chemistry and Biochemistry (10A)
Box 41061
Lubbock, TX 79409-1061
(806) 834-0813 (voice); (806) 742-1289 (fax)