Jeff Squyres <jsquyres <at> cisco.com> writes:

> 
> On Oct 31, 2007, at 9:52 PM, Neeraj Chourasia wrote:
> 
> >     but the program is running on TCP interconnect with same  
> > datasize and also on IB with small datasize say 1MB. So i dont  
> > think problem is in OpenMPI, it has to do something with IB logic,  
> > which probably doesnt work well with threads.
> 
> Open MPi's TCP nominally supports threads, but I'd be surprised if it  
> works consistently (i.e., it has not been tested thoroughly).  The  
> Open MPI IB code definitely does not yet work with threads.
> 
> > I also tried the program with MPI_THREAD_SERIALIZED, but in vain.
> 
> Open MPI currently treats this as no different than THREAD_SINGLE;  
> the problem is that you'll still have multiple different threads  
> calling MPI simultaneously with your program.
> 
> >  When is the version 1.3 scheduled to be released? Would it fix  
> > such issues?
> 
> No.  We had been planning to make THREAD_MULTIPLE support available  
> in the 1.3 series, but there honestly has not been enough customer  
> demand for it such that we could not justify putting the resources /  
> spending the time to finish it in Open MPI.    THREAD_MULTIPLE is  
> still on the long-term roadmap, but it will not be included in the  
> 1.4 series.
> 

This is an old thread, and I'm curious if there is support now for this?  I 
have 
a large code that I'm running, a hybrid MPI/OpenMP code, that is having trouble 
over our infiniband network.  I'm running a fairly large problem (uses about 
18GB), and part way in, I get the following errors:


[[929,1],0][btl_openib_component.c:3238:handle_wc] from tebow to: tebow416 
error 
polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 
103761776 
opcode 128  vendor error 105 qp_idx 3
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 29873 on
node tebow exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


This seems very similar to the question that originated this thread, and since 
we're now on version 1.4.5 I was wondering if there was any better help for 
this 
(compiler options, run-time flags or anything), or if someone had encountered 
this problem and solved it.

Thanks,
Jack

Reply via email to