[slurm-users] Re: Reserving resources for use by non-slurm stuff

2024-04-17 Thread Sean Maxwell via slurm-users
Hi Shooktija, On Wed, Apr 17, 2024 at 7:45 AM Shooktija S N via slurm-users < slurm-users@lists.schedmd.com> wrote: > NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1 > PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-24 Thread Sean Maxwell
Hi David, Those queries then should not have to happen too often, although do you > have any indication of a range for when you say "you still wouldn't > want to query the status too frequently." Because I don't really, and > would probably opt for some compromise of every 30 seconds or so. > Eve

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Sean Maxwell
Hi David, On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann wrote: > But from your comment I understand that handling these queries in > batches would be less work for slurmdbd, right? So instead of querying > each jobid with a separate database query, it would do one database > query for the wh

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Sean Maxwell
Hi David, On Thu, Feb 23, 2023 at 8:51 AM David Laehnemann wrote: > Quick follow-up question: do you have any indication of the rate of job > status checks via sacct that slurmdbd will gracefully handle (per > second)? Or any suggestions how to roughly determine such a rate for a > given cluster

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Sean Maxwell
Hi David, scontrol - interacts with slurmctld using RPC, so it is faster, but requests put load on the scheduler itself. sacct - interacts with slurmdbd, so it doesn't place additional load on the scheduler. There is a balance to reach, but the scontrol approach is riskier and can start to interf

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Sean Maxwell
w job 9 | grep CPU_ID > Nodes=larkin CPU_IDs=32-39 Mem=25600 GRES= > > apcid=32 is processor=16 and apcid=33 is processor=48 in /proc/cpuinfo > > Thanks > > -- Paul Raines (http://help.nmr.mgh.harvard.edu) > > > > On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Sean Maxwell
--time=10:00:00 --cpus-per-task=8 --pty /bin/bash > >> $ grep -i ^cpu /proc/self/status > >> Cpus_allowed: 0780,0780 > >> Cpus_allowed_list: 7-10,39-42 > >> > >> > >> -- Paul Raines (http://help.nmr.mgh.harvard.edu) > &g

Re: [slurm-users] CPUSpecList confusion

2022-12-12 Thread Sean Maxwell
Hi Paul, Nodename=foobar \ >CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 \ >RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \ >TmpDisk=600 Gres=gpu:nvidia_rtx_a6000:1 > > The slurm.conf also has: > > ProctrackType=proctrack/cgroup > TaskPlugin=task/a

Re: [slurm-users] Cgroups not constraining memory & cores

2022-11-11 Thread Sean Maxwell
t still allows me to: > > srun --mem=100 stoopid-memory-overallocation.x > > More memory is being allocated by the node than should be allowed. > > I'm clearly doing something wrong here. Can anyone point out what it is > please? Am I just using the wrong test methodology? &g

Re: [slurm-users] Cgroups not constraining memory & cores

2022-11-08 Thread Sean Maxwell
Hi Sean, I don't see PrologFlags=Contain in your slurm.conf. It is one of the entries required to activate the cgroup containment: https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf Best, -Sean On Tue, Nov 8, 2022 at 8:16 AM Sean McGrath wrote: > Hi, > > I can't get cgroups

Re: [slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Sean Maxwell
I am happy that > its working now. > > Cheers > Dominik Baack > > > Am 27.10.2022 um 19:23 schrieb Sean Maxwell: > > It looks like you are missing some of the slurm.conf entries related to > enforcing the cgroup restrictions. I would go through the list her

Re: [slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Sean Maxwell
ainDevices=yes > ConstrainRAMSpace=yes > # > # > > I attached the slurm configuration file as well > > Cheers > Dominik > Am 27.10.2022 um 17:57 schrieb Sean Maxwell: > > Hi Dominik, > > Do you have ConstrainDevices=yes set in your cgroup.conf? > > Best,

Re: [slurm-users] GPU Allocation does not limit number of available GPUs in job

2022-10-27 Thread Sean Maxwell
Hi Dominik, Do you have ConstrainDevices=yes set in your cgroup.conf? Best, -Sean On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack < dominik.ba...@cs.uni-dortmund.de> wrote: > Hi, > > We are in the process of setting up SLURM on some DGX A100 nodes . We > are experiencing the problem that all GP

Re: [slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-24 Thread Sean Maxwell
gt; > But what is the relation between gpu restriction and cgroup? I never heard > that cgroup can limit gpu card usage. Isn’t it a feature of cuda or nvidia > driver? > > > > *发件人:* Sean Maxwell > *发送时间:* 2022年3月23日 23:05 > *收件人:* Slurm User Community List > *主题:* Re:

Re: [slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-23 Thread Sean Maxwell
Hi, If you are using cgroups for task/process management, you should verify that your /etc/slurm/cgroup.conf has the following line: ConstrainDevices=yes I'm not sure about the missing environment variable, but the absence of the above in cgroup.conf is one way the GPU devices can be unconstrain

Re: [slurm-users] Job Step Output Delay

2021-02-10 Thread Sean Maxwell
Hi Maria, Have you tried adding the -u flag (specifies unbuffered) to your srun command? https://slurm.schedmd.com/srun.html#OPT_unbuffered Your description sounds like buffering, so this might help. Thanks, -Sean On Tue, Feb 9, 2021 at 6:49 PM Maria Semple wrote: > Hello all, > > I've noti

Re: [slurm-users] failed to send msg type 6002: No route to host

2020-11-12 Thread Sean Maxwell
Hi Patrick, I have seen a similar error while configuring native X-forwarding in Slurm. It was caused by Slurm sending an IP to the compute node (as part of a message) that was not routable back to the controller host. In my case it was because the controller host was multihomed, and I had misconf

Re: [slurm-users] Limit Number of Jobs per User in Queue?

2020-03-18 Thread Sean Maxwell
Hi Mike, I think you want to set MaxSubmitJobs on the users account association. The parameter is described in the sacctmgr documentation as being the maximum number of jobs a user can have in state running or pending. https://slurm.schedmd.com/sacctmgr.html Thanks, -Sean On Wed, Mar 18, 2020

Re: [slurm-users] Virtual memory size requested by slurm

2020-01-28 Thread Sean Maxwell
Hi Mahmood, If you want the virtual memory size to be unrestricted by slurm, set VSizeFactor to 0 in slurm.conf, which according to the documentation disables virtual memory limit enforcement. https://slurm.schedmd.com/slurm.conf.html#OPT_VSizeFactor -Sean On Mon, Jan 27, 2020 at 11:47 PM Mahmo