We have a case where we need to spwan many (random) allocated MPI jobs
within the same PBS job. (I have talked to the user about changing
this behavior).
The code will work If I do:
pbsdsh -n $(($GROUP*$JOBSIZE-$JOBSIZE)) \
mpirun \
-wdir $PWD/$GROUP \
--mca plm ^tm \
--mca ras ^tm \
--hostfile $PWD/nodefile.$GROUP \
./swjv_aim &
Problem is, because only the pbs_mom on our system starts with the
correct amount of pinned/locked memory for ofed, not using the tm ras
causes ofed to fail on us.
I tried removing --mca plm ^tm which would think would use tm, to
launch processes, read from the nodefile, (which is built dynamically
in the PBS script, from PBS_NODEFILE), when you run though mpirun
fails with:
[nyx0407.engin.umich.edu:07392] plm:tm: failed to poll for a spawned
daemon, return status = 17002
In the pbs_mom logs I see this error:
08/27/2009 10:53:57;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file
descriptor (9) in tm_request, bad header Negative sign on an unsigned
datum
Is there a way to tell openmpi, start on only these hosts from your
PBS job, and start using tm?
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985