What Sam is alluding to is that the OpenFabrics driver code in OMPI is sucking 
up oodles of memory for each IB connection that you're using.  The 
receive_queues param that he sent tells OMPI to use all shared receive queues 
(instead of defaulting to one per-peer receive queue and the rest shared 
receive queues -- the per-peer RQ sucks up all the memory when you multiple it 
by N peers).


On May 19, 2011, at 11:59 AM, Samuel K. Gutierrez wrote:

> Hi,
> 
> On May 19, 2011, at 9:37 AM, Robert Horton wrote
> 
>> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
>>> Hi,
>>> 
>>> Try the following QP parameters that only use shared receive queues.
>>> 
>>> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
>>> 
>> 
>> Thanks for that. If I run the job over 2 x 48 cores it now works and the
>> performance seems reasonable (I need to do some more tuning) but when I
>> go up to 4 x 48 cores I'm getting the same problem:
>> 
>> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
>>  error creating qp errno says Cannot allocate memory
>> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
>> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
>> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
>> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now 
>> abort)
>> 
>> Any thoughts?
> 
> How much memory does each node have?  Does this happen at startup?
> 
> Try adding:
> 
> -mca btl_openib_cpc_include rdmacm
> 
> I'm not sure if your version of OFED supports this feature, but maybe using 
> XRC may help.  I **think** other tweaks are needed to get this going, but I'm 
> not familiar with the details.
> 
> Hope that helps,
> 
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> 
> 
>> 
>> Thanks,
>> Rob
>> -- 
>> Robert Horton
>> System Administrator (Research Support) - School of Mathematical Sciences
>> Queen Mary, University of London
>> r.hor...@qmul.ac.uk  -  +44 (0) 20 7882 7345
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to