I am having two problem with the integration of OpenMPI 1.3 and SGE 6.2u1, 
which we are new with both.  The troubles are getting jobs to suspend/resume 
and collect cpu time correctly.

For suspend/resume I have added the following to my mpirun command:

--mca orte_forward_job_control 1 --mca plm_rsh_daemonize_qrsh 1

and adjusted the suspend_method for the queue that it's running in.  I have not 
gotten it to place any process into the T state.  Although this is not a huge 
problem, I hope to have this working in the future.

My main problem is getting the cpu time correct.  On a multiple cpu job only 
the master nodes shows  the cpu time correct for that process, the others are 
very short and not sure what they are measuring. (I believe startup time).  
Here's and example:

cpu          0.360
cpu          0.480
cpu          0.470
cpu          0.490
cpu          0.530
cpu          0.470
cpu          0.680
cpu          464.305

And from watching the runs that time is close to the wall clock time and match 
what I see for that single process.  Now I have gotten it to give what I 
believe are correct values, but I have to include --debug-daemons option to our 
mpirun command.  With that I get the following:

cpu          73.146
cpu          72.982
cpu          73.381
cpu          73.142
cpu          73.029
cpu          73.183
cpu          73.117
cpu          73.265
cpu          73.236

I have noticed that when I get the cpu time correctly I get qrsh process that 
startup (my understanding is that this is what starts the processes on the 
remote machines) and they stay running until the jobs is finished.  When I 
don't get the correct cpu time, I see the qrsh processes start on the master 
node, but die off once they start the process on the remote nodes.  The PE 
environment looks like the following:


pe_name            orte
slots              560
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Please let me know if I can provide any more information to help figure this 
out.

Thanks,

Scott Malone
Manager, High Performance Computing Facility
Information Sciences - Research Informatics
St. Jude Children's Research Hospital
332 North Lauderdale
Memphis, TN 38105
901.495.4947
scott.mal...@stjude.org



________________________________
Email Disclaimer: www.stjude.org/emaildisclaimer

Reply via email to