Since I'm new to openMPI I wanted to make sure that I understand this. When the jobs starts orted is daemonized and because of this they are not bound the sge_shephered on each node. This results in the loss of account for those processes. I guess that when I start mpirun with debugging, the orted is no longer daemonized and is attached to the sge_shephered? If this is true, is their anyway to started the orted not daemonized without turning on debugging until 1.3.2 is available?
Thanks! Scott Malone Manager, High Performance Computing Facility Information Sciences - Research Informatics St. Jude Children's Research Hospital 332 North Lauderdale Memphis, TN 38105 901.495.4947 scott.mal...@stjude.org > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Reuti > Sent: Thursday, March 19, 2009 10:32 AM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI 1.3 and SGE 6.2u1 > > Hi, > > Am 19.03.2009 um 16:07 schrieb Malone, Scott: > > > I am having two problem with the integration of OpenMPI 1.3 and SGE > > 6.2u1, which we are new with both. The troubles are getting jobs > > to suspend/resume and collect cpu time correctly. > > > > > > > > For suspend/resume I have added the following to my mpirun command: > > > > > > > > --mca orte_forward_job_control 1 --mca plm_rsh_daemonize_qrsh 1 > > > why? In 1.3 the orted is already daemonizing because of a bug and I > only found that it's necessary for the notify feature to daemonize > the orted. > > > and adjusted the suspend_method for the queue that it's running > > in. I have not gotten it to place any process into the T state. > > Although this is not a huge problem, I hope to have this working in > > the future. > > > > > > > > My main problem is getting the cpu time correct. On a multiple cpu > > job only the master nodes shows the cpu time correct for that > > process, the others are very short and not sure what they are > > measuring. (I believe startup time). Here's and example: > > > When the orted daemonize, they are no longer bound to the > sge_shephered. As a result of this, there is noone tracking their > accounting on the nodes. This will be fixed AFAIK in 1.3.2, so that > the daemons are still bound to a running sge_shephered. > > If you need the -notify feature and corerct accouting, you will need > to wait until the qrsh_starter in SGE is fixed not to exit when they > receive a usr1/2. > > -- Reuti > > > > > > > cpu 0.360 > > > > cpu 0.480 > > > > cpu 0.470 > > > > cpu 0.490 > > > > cpu 0.530 > > > > cpu 0.470 > > > > cpu 0.680 > > > > cpu 464.305 > > > > > > > > And from watching the runs that time is close to the wall clock > > time and match what I see for that single process. Now I have > > gotten it to give what I believe are correct values, but I have to > > include --debug-daemons option to our mpirun command. With that I > > get the following: > > > > > > > > cpu 73.146 > > > > cpu 72.982 > > > > cpu 73.381 > > > > cpu 73.142 > > > > cpu 73.029 > > > > cpu 73.183 > > > > cpu 73.117 > > > > cpu 73.265 > > > > cpu 73.236 > > > > > > > > I have noticed that when I get the cpu time correctly I get qrsh > > process that startup (my understanding is that this is what starts > > the processes on the remote machines) and they stay running until > > the jobs is finished. When I don't get the correct cpu time, I see > > the qrsh processes start on the master node, but die off once they > > start the process on the remote nodes. The PE environment looks > > like the following: > > > > > > > > > > > > pe_name orte > > > > slots 560 > > > > user_lists NONE > > > > xuser_lists NONE > > > > start_proc_args /bin/true > > > > stop_proc_args /bin/true > > > > allocation_rule $round_robin > > > > control_slaves TRUE > > > > job_is_first_task FALSE > > > > urgency_slots min > > > > accounting_summary FALSE > > > > > > > > Please let me know if I can provide any more information to help > > figure this out. > > > > > > > > Thanks, > > > > > > > > Scott Malone > > > > Manager, High Performance Computing Facility > > > > Information Sciences - Research Informatics > > > > St. Jude Children's Research Hospital > > > > 332 North Lauderdale > > > > Memphis, TN 38105 > > > > 901.495.4947 > > > > scott.mal...@stjude.org > > > > > > > > > > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users