Hi Moshe, Moshe Mergy <moshe.me...@weizmann.ac.il> writes:
> Hi Loris > > indeed https://slurm.schedmd.com/resource_limits.html explains the > possibilities of limitations > > At present time, I do no limit memory for specific users, but just a global > limitation in slurm.conf: > > MaxMemPerNode=65536 (for 64 GB limitation) > > But... anyway, for my Slurm version 20.02, any user can obtain MORE than 64 > GB of memory by using the "--mem=0" option ! > > So I had to filter this in job_submit.lua We don't use MaxMemPerNode but define RealMemory for groups of nodes which have the same amount of RAM. We share the nodes and use SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory So a job can't start on a node if it requests more memory than available, i.e. more than RealMemory minus memory already committed to other jobs, even if --mem=0 is specified (I guess). Cheers, Loris > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Loris > Bennett <loris.benn...@fu-berlin.de> > Sent: Thursday, December 8, 2022 10:57:56 AM > To: Slurm User Community List > Subject: Re: [slurm-users] srun --mem issue > > Loris Bennett <loris.benn...@fu-berlin.de> writes: > >> Moshe Mergy <moshe.me...@weizmann.ac.il> writes: >> >>> Hi Sandor >>> >>> I personnaly block "--mem=0" requests in file job_submit.lua (slurm 20.02): >>> >>> if (job_desc.min_mem_per_node == 0 or job_desc.min_mem_per_cpu == 0) >>> then >>> slurm.log_info("%s: ERROR: unlimited memory requested", log_prefix) >>> slurm.log_info("%s: ERROR: job %s from user %s rejected because of >>> an invalid (unlimited) memory request.", log_prefix, job_desc.name, >>> job_desc.user_name) >>> slurm.log_user("Job rejected because of an invalid memory >>> request.") >>> return slurm.ERROR >>> end >> >> What happens if somebody explicitly requests all the memory, so in >> Sandor's case --mem=500G ? >> >>> Maybe there is a better or nicer solution... > > Can't you just use account and QOS limits: > > https://slurm.schedmd.com/resource_limits.html > > ? > > And anyway, what is the use-case for preventing someone using all the > memory? In our case, if someone really need all the memory, they should be > able > to have it. > > However, I do have a chronic problem with users requesting too much > memory. My approach has been to try to get people to use 'seff' to see > what resources their jobs in fact need. In addition each month we > generate a graphical summary of 'seff' data for each user, like the one > shown here > > > https://www.fu-berlin.de/en/sites/high-performance-computing/Dokumentation/Statistik > > and automatically send an email to those with a large percentage of > resource-inefficient jobs telling them to look at their graphs and > correct their resource requirements for future jobs. > > Cheers, > > Loris > >>> All the best >>> Moshe >>> >>> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of >>> Felho, Sandor <sandor.fe...@transunion.com> >>> Sent: Wednesday, December 7, 2022 7:03 PM >>> To: slurm-users@lists.schedmd.com >>> Subject: [slurm-users] srun --mem issue >>> >>> TransUnion is running a ten-node site using slurm with multiple queues. We >>> have an issue with --mem parameter. The is one user who has read the slurm >>> manual and found the >>> --mem=0. This is giving the maximum memory on the node (500 GiB's) for the >>> single job. How can I block a --mem=0 request? >>> >>> We are running: >>> >>> * OS: RHEL 7 >>> * cgroups version 1 >>> * slurm: 19.05 >>> >>> Thank you, >>> >>> Sandor Felho >>> >>> Sr Consultant, Data Science & Analytics >>> -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin