Re: [OMPI users] OpenMPI 1.3 and SGE 6.2u1

Reuti Thu, 19 Mar 2009 11:32:32 -0400

Hi,

Am 19.03.2009 um 16:07 schrieb Malone, Scott:

I am having two problem with the integration of OpenMPI 1.3 and SGE6.2u1, which we are new with both. The troubles are getting jobsto suspend/resume and collect cpu time correctly.
For suspend/resume I have added the following to my mpirun command:



--mca orte_forward_job_control 1 --mca plm_rsh_daemonize_qrsh 1

why? In 1.3 the orted is already daemonizing because of a bug and Ionly found that it's necessary for the notify feature to daemonizethe orted.

and adjusted the suspend_method for the queue that it’s runningin. I have not gotten it to place any process into the T state.Although this is not a huge problem, I hope to have this working inthe future.
My main problem is getting the cpu time correct. On a multiple cpujob only the master nodes shows the cpu time correct for thatprocess, the others are very short and not sure what they aremeasuring. (I believe startup time). Here’s and example:

When the orted daemonize, they are no longer bound to thesge_shephered. As a result of this, there is noone tracking theiraccounting on the nodes. This will be fixed AFAIK in 1.3.2, so thatthe daemons are still bound to a running sge_shephered.

If you need the -notify feature and corerct accouting, you will needto wait until the qrsh_starter in SGE is fixed not to exit when theyreceive a usr1/2.


-- Reuti

cpu          0.360

cpu          0.480

cpu          0.470

cpu          0.490

cpu          0.530

cpu          0.470

cpu          0.680

cpu          464.305
And from watching the runs that time is close to the wall clocktime and match what I see for that single process. Now I havegotten it to give what I believe are correct values, but I have toinclude --debug-daemons option to our mpirun command. With that Iget the following:
cpu          73.146

cpu          72.982

cpu          73.381

cpu          73.142

cpu          73.029

cpu          73.183

cpu          73.117

cpu          73.265

cpu          73.236
I have noticed that when I get the cpu time correctly I get qrshprocess that startup (my understanding is that this is what startsthe processes on the remote machines) and they stay running untilthe jobs is finished. When I don’t get the correct cpu time, I seethe qrsh processes start on the master node, but die off once theystart the process on the remote nodes. The PE environment lookslike the following:
pe_name            orte

slots              560

user_lists         NONE

xuser_lists        NONE

start_proc_args    /bin/true

stop_proc_args     /bin/true

allocation_rule    $round_robin

control_slaves     TRUE

job_is_first_task  FALSE

urgency_slots      min

accounting_summary FALSE
Please let me know if I can provide any more information to helpfigure this out.
Thanks,



Scott Malone

Manager, High Performance Computing Facility

Information Sciences - Research Informatics

St. Jude Children’s Research Hospital

332 North Lauderdale

Memphis, TN 38105

901.495.4947

scott.mal...@stjude.org






Email Disclaimer: www.stjude.org/emaildisclaimer
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] OpenMPI 1.3 and SGE 6.2u1

Reply via email to