Just double checking. Can you check on your worker node
1.
ls -la /etc/pam.d/*slurm*
(just checking if there's a specific pam file for slurmd on your system)
1.
scontrol show config | grep -i SlurmdUser
(checking if slurmd is set up with a different user to SlurmUser)
1.
grep slurm /e
slurmctld runs as the user slurm, whereas slurmd runs as root.
Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm
to read the files
e.g. you could do (as root)
sudo -u slurm ls /app/slurm-24.0.8/lib/slurm
and see if the slurm user can read the directory (as well as t
On the worker node, check if cgroups are mounted
grep cgroup /proc/mounts
(normally it's in /sys/fs/cgroup )
then check if Slurm is setting up the cgroup
find /sys/fs/cgroup | grep slurm
e.g.
[root@spartan-gpgpu164 ~]# find /sys/fs/cgroup/memory | grep slurm
/sys/fs/cgroup/memory/slurm
/sys/f
Hi Willy,
sacctmgr modify account slurmaccount user=baduser set maxjobs=0
Sean
From: slurm-users on behalf of
Markuske, William
Sent: Friday, 26 May 2023 09:16
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] Temporary Stop User Submission
Ext
Hi Jeff,
The support system is here - https://bugs.schedmd.com/
Create an account, log in, and when creating a request, select your site from
the Site selection box.
Sean
From: slurm-users on behalf of Jeffrey
R. Lang
Sent: Friday, 25 March 2022 08:48
To: slu
Did you build Slurm yourself from source? If so, when you build from source, on
that node, you need to have the munge-devel package installed (munge-devel on
EL systems, libmunge-dev on Debian)
You then need to set up munge with a shared munge key between the nodes, and
have the munge daemon ru
Any error in slurmd.log on the node or slurmctld.log on the ctl?
Sean
From: slurm-users on behalf of Wayne
Hendricks
Sent: Saturday, 15 January 2022 16:04
To: slurm-us...@schedmd.com
Subject: [EXT] [slurm-users] Strange sbatch error with 21.08.2&5
External ema
n of Mariadb are you using?
Brian Andrus
On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote:
After installation of libmariadb-dev, I have reinstalled the entire slurm with
./configure + options, make, and make install. Still,
accounting_storage_mysql.so is missing.
On Sat, Dec 4, 2021 at 12:24 A
Did you run
./configure (with any other options you normally use)
make
make install
on your DBD server after you installed the mariadb-devel package?
From: slurm-users on behalf of Giuseppe
G. A. Celano
Sent: Saturday, 4 December 2021 10:07
To: Slurm User Comm
Sent: Thursday, 21 October 2021 21:54
To: slurm-users@lists.schedmd.com ; Sean Crosby
Subject: Re: [EXT] Re: [slurm-users] Missing data in sreport for a time period
in slurm
External email: Please exercise caution
Hi Sean,
After changing those values yesterda
; Sean Crosby
Subject: Re: [EXT] Re: [slurm-users] Missing data in sreport for a time period
in slurm
External email: Please exercise caution
Dear All,
By checking the value of last ran table, hourly rollup shows today's
sreport keeps a track of when it has done the last rollup calculations in the
database.
Open MySQL for your Slurm accounting database, do
select * from slurm_acct_db.clustername_last_ran_table;
where slurm_acct_db is your accounting database name (slurm_acct_db is
default), and clustername is
s root ? Can this be an issue
Amjad
On Tue, Aug 31, 2021 at 8:22 AM Sean Crosby
mailto:scro...@unimelb.edu.au>> wrote:
What does sacctmgr show for the user you added to have access to the QoS, and
what does Slurm show for the partition config?
sacctmgr show account withassoc -p
scontr
...@gmail.com>> wrote:
Hi Sean,
Thanks for the suggestion, seems to work now.
Majid
On Fri, Aug 27, 2021 at 12:56 PM Sean Crosby
mailto:scro...@unimelb.edu.au>> wrote:
Hi Amjad,
Make sure you have qos in the config entry AccountingStorageEnforce
e.g.
AccountingStorageEnforce=associa
Hi Fritz,
job_submit_lua.so gets made upon compilation of Slurm if you have the lua-devel
package installed at the time of configure/make.
Sean
From: slurm-users on behalf of
Ratnasamy, Fritz
Sent: Tuesday, 31 August 2021 15:05
To: Slurm User Community List
S
Hi Amjad,
Make sure you have qos in the config entry AccountingStorageEnforce
e.g.
AccountingStorageEnforce=associations,limits,qos,safe
Sean
From: slurm-users on behalf of Amjad
Syed
Sent: Friday, 27 August 2021 20:28
To: slurm-us...@schedmd.com
Subject: [
Hi Felix,
>From one of the recent Slurm user group meetings, the recommended way to
>logrotate the Slurm logs is to send SIGUSR2.
My logrotate entry is
/var/log/slurm/slurmctld.log {
compress
missingok
nocopytruncate
nocreate
delaycompress
nomail
notifempty
noolddir
rotate 5
Hi Mike,
To build pam_slurm_adopt, you need the pam-devel package installed on the node
you're building Slurm on.
On RHEL, it's pam-devel, and Debian it's libpam-dev
Once you have installed that, do ./configure again, and then you should be able
to make the pam_slurm_adopt
Sean
__
Hi Sid,
On our cluster, it performs just like your PBS cluster.
$ srun -N 1 --cpus-per-task 8 --time 01:00:00 --mem 2g --partition physicaltest
-q hpcadmin --pty python3
srun: job 27060036 queued and waiting for resources
srun: job 27060036 has been allocated resources
Python 3.6.8 (default, Aug
We use sacctmgr list stats for our Slurmdbd check
Our Nagios check is
RESULT=$(/usr/local/slurm/latest/bin/sacctmgr list stats)
if [ $? -ne 0 ]
then
echo "ERROR: cannot connect to database"
exit 2
fi
echo "$RESULT" | head -n 4
exit 0
Sean
From: sl
Hi Paul,
Try
sacctmgr modify qos gputest set flags=DenyOnLimit
Sean
From: slurm-users on behalf of Paul
Raines
Sent: Saturday, 29 May 2021 12:48
To: slurm-users@lists.schedmd.com
Subject: [EXT] [slurm-users] rejecting jobs that exceed QOS limits
External ema
ling=256
>AllocTRES=
>CapWatts=n/a
>CurrentWatts=0 AveWatts=0
>ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>Comment=(null)
>
>
>
>
> On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby
> wrote:
>
>> Hi Cristobal,
>>
>> My hunch is
Hi Cristobal,
My hunch is it is due to the default memory/CPU settings.
Does it work if you do
srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of
name resolution works. You have set the names in
Slurm to be wn001-wn044, so every node has to be able to resolve those
names. Hence the check using ping
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Vic
node
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Thu, 8 Apr 2021 at 16:38, Ioannis Botsis wrote:
> * UoM notice: External email. Be cautious of links, at
I just checked my cluster and my spool dir is
SlurmdSpoolDir=/var/spool/slurm
(i.e. without the d at the end)
It doesn't really matter, as long as the directory exists and has the
correct permissions on all nodes
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Comp
rectory on all your
nodes. It needs to be owned by user slurm
ls -lad /var/spool/slurmd
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Tue, 6 Apr 2021 at 20:37, Sean Cro
ow cluster
If that doesn't work, try changing AccountingStorageHost in slurm.conf to
localhost as well
For your worker nodes, your nodes are all in drain state.
Show the output of
scontrol show node wn001
It will give you the reason for why the node is drained.
Sean
--
Sean Crosby | S
It looks like your attachment of sinfo -R didn't come through
It also looks like your dbd isn't set up correctly
Can you also show the output of
sacctmgr list cluster
and
scontrol show config | grep ClusterName
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lea
The other thing I notice for my slurmdbd.conf is that I have
DbdAddr=localhost
DbdHost=localhost
You can try changing your slurmdbd.conf to set those 2 values as well to
see if that gets slurmdbd to listen on port 6819
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research
Interesting. It looks like slurmdbd is not opening the 6819 port
What does
ss -lntp | grep 6819
show? Is something else using that port?
You can also stop the slurmdbd service and run it in debug mode using
slurmdbd -D -vvv
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
What's the output of
ss -lntp | grep $(pidof slurmdbd)
on your dbd host?
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Tue, 6 Apr 2021 at 05:00, wrote:
&g
try connecting to port 6819 on the host 10.0.0.100, and output
nothing if the connection works, and would output Connection not working
otherwise
I would also test this on the DBD server itself
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business
out the lines
AccountingStorageUser=slurm
AccountingStoragePass=/run/munge/munge.socket.2
You shouldn't need those lines
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
O
gt; David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
> dw...@drexel.edu 215.571.4335 (o)
> For URCF support: urcf-supp...@drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
>
> --
> *From:* slurm
What are your Slurm settings - what's the values of
ProctrackType
JobAcctGatherType
JobAcctGatherParams
and what's the contents of cgroup.conf? Also, what version of Slurm are you
using?
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services
On Sat, 13 Mar 2021 at 08:48, Prentice Bisbal wrote:
> * UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts *
> --
>
> It sounds like your confusing job steps and tasks. For an MPI program,
> tasks and MPI ranks are the same thin
r QoS, set the OverPartQOS flag, and get the
users to specify that QoS.
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Tue, 2 Mar 2021 at 08:24, Stack Korora wrote:
&g
~]# cat
/sys/fs/cgroup/cpuset/slurm/uid_11470/job_24115684/cpuset.cpus
58
I will keep searching. I know we capture the real CPU ID as well, using
daemons running on the worker nodes, and we feed that into Ganglia.
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing
Licenses=(null) Network=(null)
Note the CPU_IDs and GPU IDX in the output
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser
It shows that for this node, it has 72 cores and 1.5TB RAM (the CfgTRES
part), and currently jobs are using 72 cores, and 442GB RAM.
I would run the same command on 4 or 5 of the nodes on your cluster, and
we'll have a better idea about what's going on.
Sean
--
Sean Crosby | Senior Dev
MSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes
TaskAffinity=no
CgroupMountpoint=/sys/fs/cgroup
The ConstrainDevices=yes is the key to stopping jobs from having access to
GPUs they didn't request.
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Com
contact the
new compute node on SlurmdPort.
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Wed, 16 Dec 2020 at 03:48, Olaf Gellert wrote:
> UoM notice: External email.
Hi Loris,
We have a completely separate test system, complete with a few worker
nodes, separate slurmctld/slurmdbd, so we can test Slurm upgrades etc.
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne
NO_VAL) then
slurm.user_msg("--gpus-per-task option requires
--tasks specification")
return ESLURM_BAD_TASK_COUNT
end
end
end
end
end
end
end
Let me know if you improve it
nodes, try communicating with
the other Slurmd's
e.g. from SRVGRIDSLURM01 do
nc -z SRVGRIDSLURM02 6818 || echo Cannot communicate
nc -z srvgridslurm03 6818 || echo Cannot communicate
Replace 6818 with the port you get from the scontrol show config command
earlier
Sean
--
Sean Crosby | S
Make sure slurmd on the client is stopped, and then run it in verbose mode
in the foreground
e.g.
/usr/local/slurm/latest/sbin/slurmd -D -v
Then post the output
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of
Hi Lars,
Do the regular slurm commands work from the client?
e.g.
squeue
scontrol show part
If they don't, it would be a sign of communication problems.
Is there a software firewall running on the master/client?
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Res
fying timelimit
accurately) means that cores will go idle when there are jobs that could
use them. If you're happy with that, then all is fine.
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Vic
$?
1
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Wed, 8 Jul 2020 at 01:14, Jason Simms wrote:
> *UoM notice: External email. Be cautious of links, attachments,
Can you see if it is set? Using (e.g. scontrol show job 337475 or sacct -j
337475 -o Timelimit)
Sean
>
> Thanks again
>
> On Tue, Jul 7, 2020 at 11:39 AM Sean Crosby
> wrote:
>
>> Hi,
>>
>> What you have described is how the backfill scheduler works. If a lower
y job from
starting in its original time.
In your example job list, can you also list the requested times for each
job? That will show if it is the backfill scheduler doing what it is
designed to do.
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services
You have to install the pam-devel package on the server you use to build
Slurm on. You'll then need to configure and then make.
Then you'll be able to make the files in the contrib/pam_slurm_adopt folder
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research
could have different QoS names for all the partitions across
all of your clusters, and set the limits on the QoS?
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Sat, 20 Jun
Do you have other limits set? The QoS is hierarchical, and especially
partition QoS can override other QoS.
What's the output of
sacctmgr show qos -p
and
scontrol show part
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Service
Hi Thomas,
That value should be
sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Wed, 6 May 2020 at 04:53, Theis
Hi Lisa,
cons_tres is part of Slurm 19.05 and higher. As you are using Slurm 18.08,
it won't be there. The select plugin for 18.05 is cons_res.
Is there a reason why you're using an old Slurm?
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computin
Who owns the munge directory and key? Is it the right uid/gid? Is the munge
daemon running?
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Thu, 16 Apr 2020 at 04:57, Dean
What happens if you change
AccountingStorageHost=localhost
to
AccountingStorageHost=192.168.1.1
i.e. same IP address as your ctl, and restart the ctld
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Wed, 26 Feb 2020 at 20:52, Pär Lundö
mailto:par.lu...@foi.se>> wrote:
Hi,
Thank you for your quick replies.
Please bear with m
What services did you restart after changing the slurm.conf? Did you do an
scontrol reconfigure?
Do you have any reservations? scontrol show res
Sean
On Tue, 17 Dec. 2019, 10:35 pm Mahmood Naderan,
mailto:mahmood...@gmail.com>> wrote:
>Your running job is requesting 6 CPUs per node (4 nodes, 6
Hi Mahmood,
Your running job is requesting 6 CPUs per node (4 nodes, 6 CPUs per node). That
means 6 CPUs are being used on node hpc.
Your queued job is requesting 5 CPUs per node (4 nodes, 5 CPUs per node). In
total, if it was running, that would require 11 CPUs on node hpc. But hpc only
has 1
Looking at the SLURM code, it looks like it is failing with a call to
getpwuid_r on the ctld
What is (on slurm-master):
getent passwd turing
getent passwd 1000
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Platform Services | Business Services
CoEPP Research
Hi David,
What does:
scontrol show node orange01
scontrol show node orange02
show? Just to see if there's a default node weight hanging around, and if your
weight changes have been picked up.
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Ser
se_uid
session required pam_unix.so
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne
On Wed, 17 Jul 2019 at 21:05, Andy Georges
mailto:andy.geor...@ugent.be>> wr
scheduling individually. The default value is 1.
Add Weight=1000 to the serv1 line, and serv2 should be given the job first.
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne
On Sun, 1
How did you compile SLURM? Did you add the contribs/pmi and/or contribs/pmi2
plugins to the install? Or did you use PMIx?
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne
On Thu
Hi Andrés,
Did you recompile OpenMPI after updating to SLURM 19.05?
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne
On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz
mailto:ama
de has been revamped, and no longer relies on
libssh2 to function. However, support for --x11 alongside sbatch has
been removed, as the new forwarding code relies on the allocating
salloc or srun command to process the forwarding.
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Re
Hi Mahmood,
I've never tried using the native X11 of SLURM without being ssh'ed into the
submit node.
Can you try ssh'ing with X11 forwarding to rocks7 (i.e. ssh -X user@rocks7)
from a different machine, and then try your srun --x11 command?
Sean
--
Sean Crosby
Senior DevOpsH
Hi Mahmood,
Are you physically logged into rocks7? Or are you connecting via SSH? $DISPLAY
= :1 kind of means that you are physically logged into the machine
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of
Hi Mahmood,
To get native X11 working with SLURM, we had to add this config to sshd_config
on the login node (your rocks7 host)
X11UseLocalhost no
You'll then need to restart sshd
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Comp
Hi Eric,
Look at partition QOS - https://slurm.schedmd.com/SLUG15/Partition_QOS.pdf
The QoS options are MaxJobsPerUser and MaxSubmitPerUser (and also PerAccount
versions)
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP
Hi Alex,
What's the actual content of your gres.conf file? Seems to me that you have
a trailing comma after the location of the nvidia device
Our gres.conf has
NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia0
Cores=0,2,4,6,8,10,12,14,16,18,20,22
NodeName=gpuhost[001-077] Name=gpu T
Hi,
When a user requests all of the GPUs on a system, but less than the total
number of CPUs, the CPU bindings aren't ideal
[root@host ~]# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_3 mlx5_1 mlx5_2 mlx5_0 CPU Affinity
GPU0 X PHB SYS SYS SYS PHB SYS PHB
0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16
75 matches
Mail list logo