Correct again. I have opened a ticket to move the qmaster from a VM to its own full blade and I turned off schedd_job_info. Thanks again. :)
I tried education. Doesn’t always work. Mfg, Juan Jimenez System Administrator, BIH HPC Cluster MDC Berlin / IT-Dept. Tel.: +49 30 9406 2800 On 28.06.17, 12:12, "William Hay" <w....@ucl.ac.uk> wrote: On Wed, Jun 28, 2017 at 08:35:52AM +0000, juanesteban.jime...@mdc-berlin.de wrote: > I figured it would complain if I did that live so I did shut it down first. Good advice anyway. > > It wasn???t one particular job. One user submitted 140,000 jobs by mistake. The qmaster job would run out of memory. I then increased the memory on the VM to 32GB and it managed to survive, but at the end of the day had only finished 1,000 jobs. That???s why I shut it down and blew away the jobs folder in the qmaster spool directory. When I restarted it, it was clearly surprised, but recovered and settled down. This morning, however, the same user submitted 40,000 jobs and it hiccupped. I restarted it twice in debug mode and it worked itself through the error (something about NULL value in CE_string_val). Now it???s running in debug mode and seems to be settling down. > > I have now limited users to 20,000 jobs and max jobs in the queue to 20,000. The qmaster really needs a redesign. Holding ???everything- in memory is simply not a good idea. > If some details of each job aren't held in memory then potentially every job will need to be read in from disk every scheduling cycle in order to determine if it is possible to schedule it. I think the underlying assumption is that the load on the qmaster will be somewhat proportionate to the size of the cluster and if you have a large cluster you can afford to spend money on memory for the node hosting the qmaster. That said 40000 jobs shouldn't be exhausting the memory on a 32GB VM. That's close to 1M per job. If you want to keep the limits as you have you might want to look into tuning the system. In particular I'd look at setting schedd_job_info to false in the scheduler configuration. Prior to SoGE 8.1.7 setting this to true could cause a memory leak. Depending on your config having schedd_job_info set true could still lead to a lot of memory usage if not an actual leak per se. Likewise encouraging (by means of appropriate limits and education) users to submit array jobs rather than lots of individual jobs should help keep the qmaster from being overwhelmed. > Mfg, > Juan Jimenez > System Administrator, BIH HPC Cluster > MDC Berlin / IT-Dept. > Tel.: +49 30 9406 2800 > > On 28.06.17, 10:30, "William Hay" <w....@ucl.ac.uk> wrote: > > On Tue, Jun 27, 2017 at 02:40:18PM +0000, juanesteban.jime...@mdc-berlin.de wrote: > > I can???t get qmaster to respond. Memory is no longer an issue but the queue is 138,000+ jobs long and it???s not responding to any control commands. I need to manually delete the master job list. > > > > Am I correct in assuming that if I delete all the subdirectories in the jobs folder in spool/qmaster, this will reset the master job list and give me back control? > The qmaster will probably get very confused if you modify the job spool live. When I need to delete a job from the spool I normally shut down the qmaster first. > If the qmaster isn't responding to regular commands then the kill command should do the job. > > You could also check the grid engine logs to see if the qmaster is complaining about a particular job and try deleting just that job from the spool. > It might be a corrupt job record rather than the raw number of jobs. > > > William > > _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss