On Jun 6, 2007, at 5:44 PM, Michael Edwards wrote:
I am runing open-mpi 1.1.1-1 compiled from OFED1.1 which I downloaded
from their website.
You might want to upgrade your Open MPI installation; the current
stable version is 1.2.2 (1.2.3 is pending shortly, fixing a few minor
regressions that creeped into 1.2.2). You can upgrade OMPI
independent of OFED. Use the "--with-openib=/usr/local/ofed" option
to OMPI's configure to pick up the OFED 1.1 installation (or, if you
used a different OFED prefix, use that as the value for the --with-
openib flag).
I am using SGE installed via OSCAR 5.0 and when running under SGE I
get the "mca_mpool_openib_register: ibv_reg_mr(0x590000,528384) failed
with error: Cannot allocate memory" error discussed at length in your
FAQ.
When I run from the command line using mpirun, I don't get the errors.
Of course, I don't know how to tell if the code is actually using the
IB interface instead of the GigE network...
You can tell in two ways:
1. You can force the IB network to be used:
mpirun --mca btl openib,self ...
Alternatively, you can force the use of the gigE network:
mpirun --mca btl tcp,self ...
2. If you look at the bandwidth/latency of running any benchmark
papplication, they should be obviously far better than the gigE
network. Here's running NetPIPE (http://www.scl.ameslab.gov/netpipe/):
mpirun -np 2 NPmpi
I tried the suggestions in the FAQ regarding setting the memlock
parameter in /etc/security/limits.conf: and all the nodes return
"unlimited" in response to "ulimit -l" after rebooting the nodes. The
problem persists under SGE and still does not appear when simply using
mpirun.
The problem is that the SGE daemons are not starting with these
memory limits. Therefore, processes that start under SGE inherit the
low memory limits, and things go badly from there.
I'm afraid I'm not familiar enough with SGE to know how to fix this.
One Big Thing to check is that when the SGE daemons are started at
init.d/boot time, they have the proper "unlimited" memory locked
limits. Then processes that start under SGE should inherit the
"unlimited" value and be ok. That being said, SGE may also
specifically override the memory locked limits (some resource
managers can do this based on site-wide policies). Check to see if
SGE is doing this.
I assumed it would work since openmpi 1.1.1 was included as working
with SGE in OSCAR 5.0, but I don't know how different that version and
the one included with OFED is.
Any suggestions would be appreciated.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems