Hi Shooktija,
On Wed, Apr 17, 2024 at 7:45 AM Shooktija S N via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64
> ThreadsPerCore=2 State=UNKNOWN Gres=gpu:1
> PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE
Hi David,
Those queries then should not have to happen too often, although do you
> have any indication of a range for when you say "you still wouldn't
> want to query the status too frequently." Because I don't really, and
> would probably opt for some compromise of every 30 seconds or so.
>
Eve
Hi David,
On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann
wrote:
> But from your comment I understand that handling these queries in
> batches would be less work for slurmdbd, right? So instead of querying
> each jobid with a separate database query, it would do one database
> query for the wh
Hi David,
On Thu, Feb 23, 2023 at 8:51 AM David Laehnemann
wrote:
> Quick follow-up question: do you have any indication of the rate of job
> status checks via sacct that slurmdbd will gracefully handle (per
> second)? Or any suggestions how to roughly determine such a rate for a
> given cluster
Hi David,
scontrol - interacts with slurmctld using RPC, so it is faster, but
requests put load on the scheduler itself.
sacct - interacts with slurmdbd, so it doesn't place additional load on the
scheduler.
There is a balance to reach, but the scontrol approach is riskier and can
start to interf
w job 9 | grep CPU_ID
> Nodes=larkin CPU_IDs=32-39 Mem=25600 GRES=
>
> apcid=32 is processor=16 and apcid=33 is processor=48 in /proc/cpuinfo
>
> Thanks
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:
--time=10:00:00 --cpus-per-task=8 --pty /bin/bash
> >> $ grep -i ^cpu /proc/self/status
> >> Cpus_allowed: 0780,0780
> >> Cpus_allowed_list: 7-10,39-42
> >>
> >>
> >> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
> &g
Hi Paul,
Nodename=foobar \
>CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 \
>RealMemory=256312 MemSpecLimit=32768 CpuSpecList=14-63 \
>TmpDisk=600 Gres=gpu:nvidia_rtx_a6000:1
>
> The slurm.conf also has:
>
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/a
t still allows me to:
>
> srun --mem=100 stoopid-memory-overallocation.x
>
> More memory is being allocated by the node than should be allowed.
>
> I'm clearly doing something wrong here. Can anyone point out what it is
> please? Am I just using the wrong test methodology?
&g
Hi Sean,
I don't see PrologFlags=Contain in your slurm.conf. It is one of the
entries required to activate the cgroup containment:
https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf
Best,
-Sean
On Tue, Nov 8, 2022 at 8:16 AM Sean McGrath wrote:
> Hi,
>
> I can't get cgroups
I am happy that
> its working now.
>
> Cheers
> Dominik Baack
>
>
> Am 27.10.2022 um 19:23 schrieb Sean Maxwell:
>
> It looks like you are missing some of the slurm.conf entries related to
> enforcing the cgroup restrictions. I would go through the list her
ainDevices=yes
> ConstrainRAMSpace=yes
> #
> #
>
> I attached the slurm configuration file as well
>
> Cheers
> Dominik
> Am 27.10.2022 um 17:57 schrieb Sean Maxwell:
>
> Hi Dominik,
>
> Do you have ConstrainDevices=yes set in your cgroup.conf?
>
> Best,
Hi Dominik,
Do you have ConstrainDevices=yes set in your cgroup.conf?
Best,
-Sean
On Thu, Oct 27, 2022 at 11:49 AM Dominik Baack <
dominik.ba...@cs.uni-dortmund.de> wrote:
> Hi,
>
> We are in the process of setting up SLURM on some DGX A100 nodes . We
> are experiencing the problem that all GP
gt;
> But what is the relation between gpu restriction and cgroup? I never heard
> that cgroup can limit gpu card usage. Isn’t it a feature of cuda or nvidia
> driver?
>
>
>
> *发件人:* Sean Maxwell
> *发送时间:* 2022年3月23日 23:05
> *收件人:* Slurm User Community List
> *主题:* Re:
Hi,
If you are using cgroups for task/process management, you should verify
that your /etc/slurm/cgroup.conf has the following line:
ConstrainDevices=yes
I'm not sure about the missing environment variable, but the absence of the
above in cgroup.conf is one way the GPU devices can be unconstrain
Hi Maria,
Have you tried adding the -u flag (specifies unbuffered) to your srun
command?
https://slurm.schedmd.com/srun.html#OPT_unbuffered
Your description sounds like buffering, so this might help.
Thanks,
-Sean
On Tue, Feb 9, 2021 at 6:49 PM Maria Semple wrote:
> Hello all,
>
> I've noti
Hi Patrick,
I have seen a similar error while configuring native X-forwarding in Slurm.
It was caused by Slurm sending an IP to the compute node (as part of a
message) that was not routable back to the controller host. In my case it
was because the controller host was multihomed, and I had misconf
Hi Mike,
I think you want to set MaxSubmitJobs on the users account association. The
parameter is described in the sacctmgr documentation as being the maximum
number of jobs a user can have in state running or pending.
https://slurm.schedmd.com/sacctmgr.html
Thanks,
-Sean
On Wed, Mar 18, 2020
Hi Mahmood,
If you want the virtual memory size to be unrestricted by slurm, set
VSizeFactor to 0 in slurm.conf, which according to the documentation
disables virtual memory limit enforcement.
https://slurm.schedmd.com/slurm.conf.html#OPT_VSizeFactor
-Sean
On Mon, Jan 27, 2020 at 11:47 PM Mahmo
19 matches
Mail list logo