What Sam is alluding to is that the OpenFabrics driver code in OMPI is sucking up oodles of memory for each IB connection that you're using. The receive_queues param that he sent tells OMPI to use all shared receive queues (instead of defaulting to one per-peer receive queue and the rest shared receive queues -- the per-peer RQ sucks up all the memory when you multiple it by N peers).
On May 19, 2011, at 11:59 AM, Samuel K. Gutierrez wrote: > Hi, > > On May 19, 2011, at 9:37 AM, Robert Horton wrote > >> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote: >>> Hi, >>> >>> Try the following QP parameters that only use shared receive queues. >>> >>> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 >>> >> >> Thanks for that. If I run the job over 2 x 48 cores it now works and the >> performance seems reasonable (I need to do some more tuning) but when I >> go up to 4 x 48 cores I'm getting the same problem: >> >> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] >> error creating qp errno says Cannot allocate memory >> [compute-1-7.local:18106] *** An error occurred in MPI_Isend >> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD >> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list >> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now >> abort) >> >> Any thoughts? > > How much memory does each node have? Does this happen at startup? > > Try adding: > > -mca btl_openib_cpc_include rdmacm > > I'm not sure if your version of OFED supports this feature, but maybe using > XRC may help. I **think** other tweaks are needed to get this going, but I'm > not familiar with the details. > > Hope that helps, > > Samuel K. Gutierrez > Los Alamos National Laboratory > > >> >> Thanks, >> Rob >> -- >> Robert Horton >> System Administrator (Research Support) - School of Mathematical Sciences >> Queen Mary, University of London >> r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345 >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/