date:20241002

[slurm-users] slurmctld keeps segfaulting, possibly during or just after backfill

2024-10-02 Thread Marcus Lauer via slurm-users

We are running into a problem where slurmctld is segfaulting a few
times a day. We had this problem with SLURM 23.11.8 and now with 23.11.10
as well, though the problem only appears on one of the several SLURM
clusters we have, and all of them use one of those versions of SLURM. I was
wondering if anyone has encountered a similar issue and has any thoughts on
how to prevent this.

Obviously we use "SchedulerType=sched/backfill" but strangely when
I switched to sched/builtin for a while there were still slurmctld
segfaults. We also set
"SchedulerParameters=enable_user_top,bf_max_job_test=2000". I have tried
turning those off but it did not help. I have also tried tweaking several
other settings to no avail. Most of the cluster runs Rocky Linux 8.10
(including the slurmctld system) though we still have some Scientific Linux
7.9 compute nodes (we compile SLURM separately for those).

Here is the crash-time error from journalctl:

Oct 02 06:31:20 our.host.name kernel: sched_agent[2048355]: segfault at 8
ip 7fec755d7ea8 sp 7fec6bffe7e8 error 4 in
libslurmfull.so[7fec7555a000+1f4000]
Oct 02 06:31:20 our.host.name kernel: Code: 48 39 c1 7e 19 48 c1 f8 06 ba
01 00 00 00 48 d3 e2 48 f7 da 48 0b 54 c6 10 48 21 54 c7 10 c3 b8 00 00 00
00 eb da 48 8b 4f 08 <48> 39 4e 08 48 0f 4e 4e 08 49 89 c9 48 83 f9 3f 76
4e ba 40 00 00
Oct 02 06:31:20 our.host.name systemd[1]: Started Process Core Dump (PID
2169426/UID 0).
Oct 02 06:31:20 our.host.name systemd-coredump[2169427]: Process 2048344 (
slurmctld) of user 991 dumped core.

This is followed by a list of each of the dozen or so related threads. The
one which is dumping core is first and looks like this:

Stack trace of thread 2048355:
#0  0x7fec755d7ea8 bit_and_not (libslurmfull.so)
#1  0x0044531f _job_alloc (slurmctld)
#2  0x0044576b _job_alloc_whole_node_internal (slurmctld)
#3  0x00446e6d gres_ctld_job_alloc_whole_node (slurmctld)
#4  0x7fec722e29b8 job_res_add_job (select_cons_tres.so)
#5  0x7fec722f7c32 select_p_select_nodeinfo_set (select_cons_tres.so)
#6  0x7fec756e7dc7 select_g_select_nodeinfo_set (libslurmfull.so)
#7  0x00496eb3 select_nodes (slurmctld)
#8  0x00480826 _schedule (slurmctld)
#9  0x7fec753421ca start_thread (libpthread.so.0)
#10 0x7fec745f78d3 __clone (libc.so.6)

   I have run slurmctld with "debug5" level logging and it appears that
the error occurs right after backfill considers a large number of jobs.
Slurmctld could be failing at the end of backfill or when doing something
which happens just after backfill runs. Usually this is the last message
before the crash:

[2024-09-25T18:39:42.076] slurmscriptd: debug:  _slurmscriptd_mainloop:
finished

   If anyone has any thoughts or advice on this that would be
appreciated. Thank you.

-- 
Marcus Lauer
Systems Administrator
CETS Group, Research Support

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] GPU Accounting

2024-10-02 Thread Emyr James via slurm-users

We have a node with 8 H100 GPUs that are split into MIG instances. We are using 
cgroups. This seems to work fine. Users can do something like

sbatch --gres="gpu:1g.10gb:1"...

and the job starts on the node with the gpus and cuda visible devices and the 
pytorch debug shows that the cgroup only gives them the gpu they asked for.

In the accounting database, jobs in the job table always have the "gres_used" 
column be empty. I'd expect to see "gpu:1g.10gb:1" appearing for the job above.

I have this set in slurm.conf

AccountingStorageTRES=gres/gpu

How can I see what gres was requested with the job ? At the moment I only see 
something like this in AllocTres

billing=1,cpu=1,gres/gpu=1,mem=8G,node=1

and can't see any way to see what the specific MIG gpu asked for was. This is 
related to the email from Richard Lefebvre dated 7th June 2023 entitled 
"Billing/accounting for MIGs is not working". As far as I can see this got no 
replies.

We are running slurm version 23.11.6.

Regards,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] slurmctld keeps segfaulting, possibly during or just after backfill

[slurm-users] GPU Accounting

2 matches

Site Navigation

Mail list logo

Footer information