Hi,
Am 19.03.2009 um 16:07 schrieb Malone, Scott:
I am having two problem with the integration of OpenMPI 1.3 and SGE
6.2u1, which we are new with both. The troubles are getting jobs
to suspend/resume and collect cpu time correctly.
For suspend/resume I have added the following to my mpirun command:
--mca orte_forward_job_control 1 --mca plm_rsh_daemonize_qrsh 1
why? In 1.3 the orted is already daemonizing because of a bug and I
only found that it's necessary for the notify feature to daemonize
the orted.
and adjusted the suspend_method for the queue that it’s running
in. I have not gotten it to place any process into the T state.
Although this is not a huge problem, I hope to have this working in
the future.
My main problem is getting the cpu time correct. On a multiple cpu
job only the master nodes shows the cpu time correct for that
process, the others are very short and not sure what they are
measuring. (I believe startup time). Here’s and example:
When the orted daemonize, they are no longer bound to the
sge_shephered. As a result of this, there is noone tracking their
accounting on the nodes. This will be fixed AFAIK in 1.3.2, so that
the daemons are still bound to a running sge_shephered.
If you need the -notify feature and corerct accouting, you will need
to wait until the qrsh_starter in SGE is fixed not to exit when they
receive a usr1/2.
-- Reuti
cpu 0.360
cpu 0.480
cpu 0.470
cpu 0.490
cpu 0.530
cpu 0.470
cpu 0.680
cpu 464.305
And from watching the runs that time is close to the wall clock
time and match what I see for that single process. Now I have
gotten it to give what I believe are correct values, but I have to
include --debug-daemons option to our mpirun command. With that I
get the following:
cpu 73.146
cpu 72.982
cpu 73.381
cpu 73.142
cpu 73.029
cpu 73.183
cpu 73.117
cpu 73.265
cpu 73.236
I have noticed that when I get the cpu time correctly I get qrsh
process that startup (my understanding is that this is what starts
the processes on the remote machines) and they stay running until
the jobs is finished. When I don’t get the correct cpu time, I see
the qrsh processes start on the master node, but die off once they
start the process on the remote nodes. The PE environment looks
like the following:
pe_name orte
slots 560
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
Please let me know if I can provide any more information to help
figure this out.
Thanks,
Scott Malone
Manager, High Performance Computing Facility
Information Sciences - Research Informatics
St. Jude Children’s Research Hospital
332 North Lauderdale
Memphis, TN 38105
901.495.4947
scott.mal...@stjude.org
Email Disclaimer: www.stjude.org/emaildisclaimer
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users