Jeff Squyres <jsquyres <at> cisco.com> writes: > > On Oct 31, 2007, at 9:52 PM, Neeraj Chourasia wrote: > > > but the program is running on TCP interconnect with same > > datasize and also on IB with small datasize say 1MB. So i dont > > think problem is in OpenMPI, it has to do something with IB logic, > > which probably doesnt work well with threads. > > Open MPi's TCP nominally supports threads, but I'd be surprised if it > works consistently (i.e., it has not been tested thoroughly). The > Open MPI IB code definitely does not yet work with threads. > > > I also tried the program with MPI_THREAD_SERIALIZED, but in vain. > > Open MPI currently treats this as no different than THREAD_SINGLE; > the problem is that you'll still have multiple different threads > calling MPI simultaneously with your program. > > > When is the version 1.3 scheduled to be released? Would it fix > > such issues? > > No. We had been planning to make THREAD_MULTIPLE support available > in the 1.3 series, but there honestly has not been enough customer > demand for it such that we could not justify putting the resources / > spending the time to finish it in Open MPI. THREAD_MULTIPLE is > still on the long-term roadmap, but it will not be included in the > 1.4 series. >
This is an old thread, and I'm curious if there is support now for this? I have a large code that I'm running, a hybrid MPI/OpenMP code, that is having trouble over our infiniband network. I'm running a fairly large problem (uses about 18GB), and part way in, I get the following errors: [[929,1],0][btl_openib_component.c:3238:handle_wc] from tebow to: tebow416 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 103761776 opcode 128 vendor error 105 qp_idx 3 -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 29873 on node tebow exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- This seems very similar to the question that originated this thread, and since we're now on version 1.4.5 I was wondering if there was any better help for this (compiler options, run-time flags or anything), or if someone had encountered this problem and solved it. Thanks, Jack