Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

juanesteban.jime...@mdc-berlin.de Wed, 28 Jun 2017 03:22:48 -0700

Correct again. I have opened a ticket to move the qmaster from a VM to its own 
full blade and I turned off schedd_job_info. Thanks again. :)

I tried education. Doesn’t always work.

Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800
 

On 28.06.17, 12:12, "William Hay" <w....@ucl.ac.uk> wrote:

    On Wed, Jun 28, 2017 at 08:35:52AM +0000, juanesteban.jime...@mdc-berlin.de 
wrote:
    > I figured it would complain if I did that live so I did shut it down 
first. Good advice anyway. 
    > 
    > It wasn???t one particular job. One user submitted 140,000 jobs by 
mistake. The qmaster job would run out of memory. I then increased the memory 
on the VM to 32GB and it managed to survive, but at the end of the day had only 
finished 1,000 jobs. That???s why I shut it down and blew away the jobs folder 
in the qmaster spool directory. When I restarted it, it was clearly surprised, 
but recovered and settled down. This morning, however, the same user submitted 
40,000 jobs and it hiccupped. I restarted it twice in debug mode and it worked 
itself through the error (something about NULL value in CE_string_val). Now 
it???s running in debug mode and seems to be settling down. 
    > 
    > I have now limited users to 20,000 jobs and max jobs in the queue to 
20,000. The qmaster really needs a redesign. Holding ???everything- in memory 
is simply not a good idea. 
    > 
    
    If some details of each job aren't held in memory then potentially every
    job will need to be read in from disk every scheduling cycle in order to 
    determine if it is possible to schedule it.  I think the underlying
    assumption is that the load on the qmaster will be somewhat proportionate
    to the size of the cluster and if you have a large cluster you can afford
    to spend money on memory for the node hosting the qmaster.
    
    That said 40000 jobs shouldn't be exhausting the memory on a 32GB VM.
    That's close to 1M per job.  If you want to keep the limits as you have
    you might want to look into tuning the system.  In particular I'd look
    at setting schedd_job_info to false in the scheduler configuration.
    Prior to SoGE 8.1.7 setting this to true could cause a memory leak.
    Depending on your config having schedd_job_info set true  could
    still lead to a lot of memory usage if not an actual leak per se.
    Likewise encouraging (by means of appropriate limits and education)
    users to submit array jobs rather than lots of individual jobs should
    help keep the qmaster from being overwhelmed.
    
    > Mfg,
    > Juan Jimenez
    > System Administrator, BIH HPC Cluster
    > MDC Berlin / IT-Dept.
    > Tel.: +49 30 9406 2800
    > 
    > On 28.06.17, 10:30, "William Hay" <w....@ucl.ac.uk> wrote:
    > 
    >     On Tue, Jun 27, 2017 at 02:40:18PM +0000, 
juanesteban.jime...@mdc-berlin.de wrote:
    >     > I can???t get qmaster to respond. Memory is no longer an issue but 
the queue is 138,000+ jobs long and it???s not responding to any control 
commands. I need to manually delete the master job list.
    >     > 
    >     > Am I correct in assuming that if I delete all the subdirectories in 
the jobs folder in spool/qmaster, this will reset the master job list and give 
me back control?
    >     The qmaster will probably get very confused if you modify the job 
spool live.  When I need to delete a job from the spool I normally shut down 
the qmaster first.
    >     If the qmaster isn't responding to regular commands then the kill 
command should do the job.
    >     
    >     You could also check the grid engine logs to see if the qmaster is 
complaining about a particular job and try deleting just that job from the 
spool.
    >     It might be a corrupt job record rather than the raw number of jobs.
    >     
    >     
    >     William
    >     
    > 
    

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Re: [SGE-discuss] Qmaster unresponsive, process status "disk sleep"

Reply via email to