I figured it would complain if I did that live so I did shut it down first. 
Good advice anyway. 

It wasn’t one particular job. One user submitted 140,000 jobs by mistake. The 
qmaster job would run out of memory. I then increased the memory on the VM to 
32GB and it managed to survive, but at the end of the day had only finished 
1,000 jobs. That’s why I shut it down and blew away the jobs folder in the 
qmaster spool directory. When I restarted it, it was clearly surprised, but 
recovered and settled down. This morning, however, the same user submitted 
40,000 jobs and it hiccupped. I restarted it twice in debug mode and it worked 
itself through the error (something about NULL value in CE_string_val). Now 
it’s running in debug mode and seems to be settling down. 

I have now limited users to 20,000 jobs and max jobs in the queue to 20,000. 
The qmaster really needs a redesign. Holding –everything- in memory is simply 
not a good idea. 

Mfg,
Juan Jimenez
System Administrator, BIH HPC Cluster
MDC Berlin / IT-Dept.
Tel.: +49 30 9406 2800

On 28.06.17, 10:30, "William Hay" <w....@ucl.ac.uk> wrote:

    On Tue, Jun 27, 2017 at 02:40:18PM +0000, juanesteban.jime...@mdc-berlin.de 
wrote:
    > I can???t get qmaster to respond. Memory is no longer an issue but the 
queue is 138,000+ jobs long and it???s not responding to any control commands. 
I need to manually delete the master job list.
    > 
    > Am I correct in assuming that if I delete all the subdirectories in the 
jobs folder in spool/qmaster, this will reset the master job list and give me 
back control?
    The qmaster will probably get very confused if you modify the job spool 
live.  When I need to delete a job from the spool I normally shut down the 
qmaster first.
    If the qmaster isn't responding to regular commands then the kill command 
should do the job.
    
    You could also check the grid engine logs to see if the qmaster is 
complaining about a particular job and try deleting just that job from the 
spool.
    It might be a corrupt job record rather than the raw number of jobs.
    
    
    William
    

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to