[slurm-users] Fwd: Limiting I/O speed in slurm jobs

2023-09-07 Thread Eugene Teoh
Hi guys,

I'm trying to figure out how to set a per task/job limit for I/O speed
(IOPS, throughput or maybe both, even better, io.latency
).
After reading around the documentation and forums, I'm unable to find a
setting to set this. As a workaround, I have set the limit under the
slurmstepd cgroup(v2) scope  (/sys/fs/cgroup/system.slice/slurmstepd.scope),
but this sets it at the global level, not per task. I am aware that I can
also do the same thing with systemd.resource-control
.
But again, this would be at the global level. Does anyone know how to set
per-task resource limits for IO with cgroup using slurm?

Regards,
Eugene


Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-09-07 Thread Angel de Vicente
Hello Cristobal,

Cristóbal Navarro  writes:

> Hello Angel and Community,

> I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on
> Ubuntu 22.04 LTS) and Slurm 23.02.
> When I execute `slurmd` service, it status shows failed with the
> following information below.
> As of today, what is the best solution to this problem? I am really
> not sure if the DGX A100 could fail or not by disabling cgroups v1.
> Any suggestions are welcome.

did you manage to find a solution to this without disabling cgroups v1?

In our case:

,
| slurm 23.02.3
| Ubuntu 22.04.3 LTS
| 
| # cat /proc/cmdline 
| BOOT_IMAGE=/boot/vmlinuz-5.15.0-83-generic root=UUID=... ro quiet splash 
cgroup_no_v1=all vt.handoff=7
`

disabling cgroups v1 has been working reliably, but it would be nice to
find a solution that doesn't require modifying the kernel parameters.

Cheers,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-09-07 Thread Angel de Vicente
Hi Will,

Will Furnell - STFC UKRI  writes:

> That does sound like an interesting solution – yes please would you be
> able to send me (or us if you’re willing to share it to the list)
> through some more information please?
>
> And thank you everyone else that has replied to my email – there’s
> definitely a few solutions I need to look into here!

we also use 'seff', but it gives reliable stats only for jobs that
finished properly (i.e. COMPLETED). In our case, we would need to
collect efficiency stats also for jobs that TIMEOUT and even those that
are CANCELLED.

Do you happen to know of some way to accomplish this?

Many thanks,
-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] Slurm version 23.02.5 is now available

2023-09-07 Thread Tim McMullan

We are pleased to announce the availability of Slurm version 23.02.5.

The 23.02.5 release includes a number of stability fixes and some fixes 
for notable regressions.


The SLURM_NTASKS environment variable that in 23.02.0 was not set when 
using --ntasks-per-node has been changed back to its 22.05 behavior of 
being set. The method that is is being set, however, is different and 
should be more accurate in more situations.


The mpi/pmi2 plugin now respects the SrunPortRange option, which matches 
the behavior of the mpi/pmix plugin as of 23.02.0.


The --uid and --gid options for salloc and srun have been removed. These 
options did not work correctly since the CVE-2022-29500 fix in 
combination with some changes made in 23.02.0.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

-Tim

--
Tim McMullan
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support



* Changes in Slurm 23.02.5
==
 -- Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu
or pn_min_cpus are being automatically adjusted.
 -- Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
 -- Fix and prevent reoccurring reservations from overlapping.
 -- job_container/tmpfs - Avoid attempts to share BasePath between nodes.
 -- Change the log message warning for rate limited users from verbose to info.
 -- With CR_Cpu_Memory, fix node selection for jobs that request gres and
--mem-per-cpu.
 -- Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
 -- Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
 -- Fix slurmctld segfault when a node registers with a configured CpuSpecList
while slurmctld configuration has the node without CpuSpecList.
 -- Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not
registering by ResumeTimeout.
 -- slurmstepd - Avoid cleanup of config.json-less containers spooldir getting
skipped.
 -- slurmstepd - Cleanup per task generated environment for containers in
spooldir.
 -- Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
 -- Properly handle a race condition between bind() and listen() calls in the
network stack when running with SrunPortRange set.
 -- Federation - Fix revoked jobs being returned regardless of the -a/--all
option for privileged users.
 -- Federation - Fix canceling pending federated jobs from non-origin clusters
which could leave federated jobs orphaned from the origin cluster.
 -- Fix sinfo segfault when printing multiple clusters with --noheader option.
 -- Federation - fix clusters not syncing if clusters are added to a federation
before they have registered with the dbd.
 -- Change pmi2 plugin to honor the SrunPortRange option. This matches the new
behavior of the pmix plugin in 23.02.0. Note that neither of these plugins
makes use of the "MpiParams=ports=" option, and previously were only limited
by the systems ephemeral port range.
 -- node_features/helpers - Fix node selection for jobs requesting changeable
features with the '|' operator, which could prevent jobs from running on
some valid nodes.
 -- node_features/helpers - Fix inconsistent handling of '&' and '|', where an
AND'd feature was sometimes AND'd to all sets of features instead of just
the current set. E.g. "foo|bar&baz" was interpreted as {foo,baz} or
{bar,baz} instead of how it is documented: "{foo} or {bar,baz}".
 -- Fix job accounting so that when a job is requeued its allocated node count
is cleared. After the requeue, sacct will correctly show that the job has
0 AllocNodes while it is pending or if it is canceled before restarting.
 -- sacct - AllocCPUS now correctly shows 0 if a job has not yet received an
allocation or if the job was canceled before getting one.
 -- Fix intel oneapi autodetect: detect the /dev/dri/renderD[0-9]+ gpus, and do
not detect /dev/dri/card[0-9]+.
 -- Format batch, extern, interactive, and pending step ids into strings that
are human readable.
 -- Fix node selection for jobs that request --gpus and a number of tasks fewer
than gpus, which resulted in incorrectly rejecting these jobs.
 -- Remove MYSQL_OPT_RECONNECT completely.
 -- Fix cloud nodes in POWERING_UP state disappearing (getting set to FUTURE)
when an `scontrol reconfigure` happens.
 -- openapi/dbv0.0.39 - Avoid assert / segfault on missing coordinators list.
 -- slurmrestd - Correct memory leak while parsing OpenAPI specification
templates with server overrides.
 -- slurmrestd - Reduce memory usage when printing out job CPU frequency.
 -- Fix overwriting user node reason with system message.
 -- Remove --uid / --gid 

[slurm-users] Glusterfs hints for state database

2023-09-07 Thread Michael Gutteridge
We've settled on the idea of using a glusterfs file system for rolling out
an HA Slurm controller.  Over the last year we've averaged 88,000 job
submissions per day, though it's usually lower than that (10-20K).
Disk activity
on the existing state databaseseems to be maxing out around 40-50 io/s with
a peak disk usage under 700MB.

We're replacing that with two controller hosts (eventually configured as an
HA pair) and a DBD host.  I've spun up a 3 replica glusterfs mirror between
these hosts for the state database. The physical disks backing this storage
are all SSD.

Are there any hints, tips, or problems anyone has run into with Glusterfs
for the state database? Any recommended tunings?

Thanks much

 - Michael