I figured it would complain if I did that live so I did shut it down first. Good advice anyway.
It wasn’t one particular job. One user submitted 140,000 jobs by mistake. The qmaster job would run out of memory. I then increased the memory on the VM to 32GB and it managed to survive, but at the end of the day had only finished 1,000 jobs. That’s why I shut it down and blew away the jobs folder in the qmaster spool directory. When I restarted it, it was clearly surprised, but recovered and settled down. This morning, however, the same user submitted 40,000 jobs and it hiccupped. I restarted it twice in debug mode and it worked itself through the error (something about NULL value in CE_string_val). Now it’s running in debug mode and seems to be settling down. I have now limited users to 20,000 jobs and max jobs in the queue to 20,000. The qmaster really needs a redesign. Holding –everything- in memory is simply not a good idea. Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 28.06.17, 10:30, "William Hay" <w....@ucl.ac.uk> wrote: On Tue, Jun 27, 2017 at 02:40:18PM +0000, juanesteban.jime...@mdc-berlin.de wrote: > I can???t get qmaster to respond. Memory is no longer an issue but the queue is 138,000+ jobs long and it???s not responding to any control commands. I need to manually delete the master job list. > > Am I correct in assuming that if I delete all the subdirectories in the jobs folder in spool/qmaster, this will reset the master job list and give me back control? The qmaster will probably get very confused if you modify the job spool live. When I need to delete a job from the spool I normally shut down the qmaster first. If the qmaster isn't responding to regular commands then the kill command should do the job. You could also check the grid engine logs to see if the qmaster is complaining about a particular job and try deleting just that job from the spool. It might be a corrupt job record rather than the raw number of jobs. William _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss