We have a case where we need to spwan many (random) allocated MPI jobs within the same PBS job. (I have talked to the user about changing this behavior).

The code will work If I do:

   pbsdsh -n $(($GROUP*$JOBSIZE-$JOBSIZE)) \
             mpirun   \
             -wdir $PWD/$GROUP   \
             --mca plm ^tm \
             --mca ras ^tm \
             --hostfile $PWD/nodefile.$GROUP \
             ./swjv_aim &

Problem is, because only the pbs_mom on our system starts with the correct amount of pinned/locked memory for ofed, not using the tm ras causes ofed to fail on us.

I tried removing --mca plm ^tm which would think would use tm, to launch processes, read from the nodefile, (which is built dynamically in the PBS script, from PBS_NODEFILE), when you run though mpirun fails with:

[nyx0407.engin.umich.edu:07392] plm:tm: failed to poll for a spawned daemon, return status = 17002

In the pbs_mom logs I see this error:
08/27/2009 10:53:57;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file descriptor (9) in tm_request, bad header Negative sign on an unsigned datum

Is there a way to tell openmpi, start on only these hosts from your PBS job, and start using tm?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



Reply via email to