Brock and I talked about this on IM -- the preferred solution would be
to set the cluster nodes limits.conf to allow interactive logins to
have unlimited locked memory. That would fix the OFED issue.
On Aug 27, 2009, at 11:01 AM, Brock Palen wrote:
We have a case where we need to spwan many (random) allocated MPI jobs
within the same PBS job. (I have talked to the user about changing
this behavior).
The code will work If I do:
pbsdsh -n $(($GROUP*$JOBSIZE-$JOBSIZE)) \
mpirun \
-wdir $PWD/$GROUP \
--mca plm ^tm \
--mca ras ^tm \
--hostfile $PWD/nodefile.$GROUP \
./swjv_aim &
Problem is, because only the pbs_mom on our system starts with the
correct amount of pinned/locked memory for ofed, not using the tm ras
causes ofed to fail on us.
I tried removing --mca plm ^tm which would think would use tm, to
launch processes, read from the nodefile, (which is built dynamically
in the PBS script, from PBS_NODEFILE), when you run though mpirun
fails with:
[nyx0407.engin.umich.edu:07392] plm:tm: failed to poll for a spawned
daemon, return status = 17002
In the pbs_mom logs I see this error:
08/27/2009 10:53:57;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
descriptor (9) in tm_request, bad header Negative sign on an unsigned
datum
Is there a way to tell openmpi, start on only these hosts from your
PBS job, and start using tm?
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
jsquy...@cisco.com