Since I'm new to openMPI I wanted to make sure that I understand this.  When 
the jobs starts orted is daemonized and because of this they are not bound the 
sge_shephered on each node.  This results in the loss of account for those 
processes.  I guess that when I start mpirun with debugging, the orted is no 
longer daemonized and is attached to the sge_shephered?  If this is true, is 
their anyway to started the orted not daemonized without turning on debugging 
until 1.3.2 is available?


Thanks!

Scott Malone
Manager, High Performance Computing Facility
Information Sciences - Research Informatics
St. Jude Children's Research Hospital
332 North Lauderdale
Memphis, TN 38105
901.495.4947
scott.mal...@stjude.org
 


> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Reuti
> Sent: Thursday, March 19, 2009 10:32 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI 1.3 and SGE 6.2u1
> 
> Hi,
> 
> Am 19.03.2009 um 16:07 schrieb Malone, Scott:
> 
> > I am having two problem with the integration of OpenMPI 1.3 and SGE
> > 6.2u1, which we are new with both.  The troubles are getting jobs
> > to suspend/resume and collect cpu time correctly.
> >
> >
> >
> > For suspend/resume I have added the following to my mpirun command:
> >
> >
> >
> > --mca orte_forward_job_control 1 --mca plm_rsh_daemonize_qrsh 1
> >
> why? In 1.3 the orted is already daemonizing because of a bug and I
> only found that it's necessary for the notify feature to daemonize
> the orted.
> 
> >  and adjusted the suspend_method for the queue that it's running
> > in.  I have not gotten it to place any process into the T state.
> > Although this is not a huge problem, I hope to have this working in
> > the future.
> >
> >
> >
> > My main problem is getting the cpu time correct.  On a multiple cpu
> > job only the master nodes shows  the cpu time correct for that
> > process, the others are very short and not sure what they are
> > measuring. (I believe startup time).  Here's and example:
> >
> When the orted daemonize, they are no longer bound to the
> sge_shephered. As a result of this, there is noone tracking their
> accounting on the nodes. This will be fixed AFAIK in 1.3.2, so that
> the daemons are still bound to a running sge_shephered.
> 
> If you need the -notify feature and corerct accouting, you will need
> to wait until the qrsh_starter in SGE is fixed not to exit when they
> receive a usr1/2.
> 
> -- Reuti
> 
> >
> >
> > cpu          0.360
> >
> > cpu          0.480
> >
> > cpu          0.470
> >
> > cpu          0.490
> >
> > cpu          0.530
> >
> > cpu          0.470
> >
> > cpu          0.680
> >
> > cpu          464.305
> >
> >
> >
> > And from watching the runs that time is close to the wall clock
> > time and match what I see for that single process.  Now I have
> > gotten it to give what I believe are correct values, but I have to
> > include --debug-daemons option to our mpirun command.  With that I
> > get the following:
> >
> >
> >
> > cpu          73.146
> >
> > cpu          72.982
> >
> > cpu          73.381
> >
> > cpu          73.142
> >
> > cpu          73.029
> >
> > cpu          73.183
> >
> > cpu          73.117
> >
> > cpu          73.265
> >
> > cpu          73.236
> >
> >
> >
> > I have noticed that when I get the cpu time correctly I get qrsh
> > process that startup (my understanding is that this is what starts
> > the processes on the remote machines) and they stay running until
> > the jobs is finished.  When I don't get the correct cpu time, I see
> > the qrsh processes start on the master node, but die off once they
> > start the process on the remote nodes.  The PE environment looks
> > like the following:
> >
> >
> >
> >
> >
> > pe_name            orte
> >
> > slots              560
> >
> > user_lists         NONE
> >
> > xuser_lists        NONE
> >
> > start_proc_args    /bin/true
> >
> > stop_proc_args     /bin/true
> >
> > allocation_rule    $round_robin
> >
> > control_slaves     TRUE
> >
> > job_is_first_task  FALSE
> >
> > urgency_slots      min
> >
> > accounting_summary FALSE
> >
> >
> >
> > Please let me know if I can provide any more information to help
> > figure this out.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Scott Malone
> >
> > Manager, High Performance Computing Facility
> >
> > Information Sciences - Research Informatics
> >
> > St. Jude Children's Research Hospital
> >
> > 332 North Lauderdale
> >
> > Memphis, TN 38105
> >
> > 901.495.4947
> >
> > scott.mal...@stjude.org
> >
> >
> >
> >
> >
> >
> > Email Disclaimer: www.stjude.org/emaildisclaimer
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to