Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-27 Thread Will Furnell - STFC UKRI
Hi Magnus,

That does sound like an interesting solution - yes please would you be able to 
send me (or us if you're willing to share it to the list) through some more 
information please?

And thank you everyone else that has replied to my email - there's definitely a 
few solutions I need to look into here!

Thanks!

Will


Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-07-27 Thread ralf . utermann

Am 26.07.23 um 11:38 schrieb Ralf Utermann:

Am 25.07.23 um 02:09 schrieb Cristóbal Navarro:

Hello Angel and Community,
I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 
22.04 LTS) and Slurm 23.02.
When I execute `slurmd` service, it status shows failed with the following 
information below.


Hello Cristobal,

we see similar problems not on DGX but standard server nodes running
Ubuntu 22.04 (kernel 5.15.0-76-generic) and Slurm 23.02.3.

The first start of the slurmd service always fails, with lots of errors
in the slurmd.log like:
   error: cpu cgroup controller is not available.
   error: There's an issue initializing memory or cpu controller
After 90 seconds this slurmd service start times out and is failed.

BUT: One process is still running:
   /usr/local/slurm/23.02.3/sbin/slurmstepd infinity

This looks like the process started to handle cgroup v2 as described in
   https://slurm.schedmd.com/cgroup_v2.html

When we keep this slurmstepd infinity running, and just start
the slurmd service a second time, everything comes up running.

So our current workaround is: we configure the slurmd service
with a Restart=on-failure in the [Service] section.


Are there real solutions to this initial timeout failure?

best regards, Ralf




As of today, what is the best solution to this problem? I am really not sure if 
the DGX A100 could fail or not by disabling cgroups v1.
Any suggestions are welcome.

➜  slurm-23.02.3 systemctl status slurmd.service
× slurmd.service - Slurm node daemon
      Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor 
preset: enabled)
      Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s 
ago
     Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS 
(code=exited, status=1/FAILURE)
    Main PID: 3680019 (code=exited, status=1/FAILURE)
         CPU: 40ms

jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  Log file re-opened
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: 
hwloc_topology_export_xml
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  CPUs:128 Boards:1 
Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured 
socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not 
supported. Mounted cgroups are: 2:freezer:/
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited, 
code=exited, status=1/FAILURE
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result 
'exit-code'.
➜  slurm-23.02.3



On Wed, May 3, 2023 at 6:32 PM Angel de Vicente mailto:angel.de.vice...@iac.es>> wrote:

    Hello,

    Angel de Vicente mailto:angel.de.vice...@iac.es>> 
writes:

 > ,
 > | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
 > | 5:freezer:/
 > | 3:cpuacct:/
 > `

    in the end I learnt that despite Ubuntu 22.04 reporting to be using
    only cgroup V2, it was also using V1 and creating those mount points,
    and then Slurm 23.02.01 was complaining that it could not work with
    Cgroups in hybrid mode.

    So, the "solution" (as far as you don't need V1 for some reason) was to
    add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
    mount points and Slurm was happy with that.

    [in case somebody is interested in the future, I needed this so that I
    could limit the resources given to users not using Slurm. We have some
    shared workstations with many cores and users were oversubscribing the
    CPUs, so I have installed Slurm to put some order in the executions
    there. But these machines are not an actual cluster with a login node:
    the login node is the same as the executing node! So with cgroups I
    control that users connecting via ssh only have the resources equivalent
    to 3/4 of a core (enough to edit files, etc.) until they submit their
    jobs via Slurm, when they then get the full allocation they requested].

    Cheers,
    --     Ángel de Vicente
  Research Software Engineer (Supercomputing and BigData)
  Tel.: +34 922-605-747
  Web.: http://research.iac.es/proyecto/polmag/ 


  GPG: 0x8BDC390B69033F52



--
Cristóbal A. Navarro




--
Ralf Utermann

Universität Augsburg
Rechenzentrum
D-86135 Augsburg

ralf.uterm...@uni-a.de
https://www.rz.uni-augsburg.de




[slurm-users] Flag OVERLAP in advanced reservation

2023-07-27 Thread Gizo Nanava
Hello,

I observe strange behavior of advanced reservations having OVERLAP in their 
flag's list.

If I create two advanced reservations on different set of nodes and a 
particular username
is configured to only have an access to one with the flag OVERLAP,  then the 
username can
also run jobs on nodes in other reservation, which is reserved for other users.

In this example below, the username user1 should not have rights to start jobs 
on enos-n014:
 
> scontrol show reservations phd
ReservationName=phd StartTime=2023-07-23T08:00:00 EndTime=2023-07-23T20:00:00 
Duration=12:00:00
   Nodes=phd-n[001-032] NodeCnt=32 CoreCnt=2048 Features=(null) 
PartitionName=phd Flags=FLEX,OVERLAP,SPEC_NODES,PART_NODES,MAGNETIC 

   TRES=cpu=2048
   Users=user1 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE 
BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

> scontrol show reservations test
ReservationName=phd StartTime=2023-07-23T08:00:00 EndTime=2023-07-23T20:00:00 
Duration=12:00:00
   Nodes=enos-n014 NodeCnt=1 CoreCnt=32 Features=(null) PartitionName=(null) 
Flags=FLEX,SPEC_NODES,MAGNETIC
   TRES=cpu=16
   Users=user2 Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE 
BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

user1@login01:~$ salloc --nodelist=enos-n014
..
salloc: Pending job allocation 1816915
..
salloc: Nodes enos-n014 are ready for job
user1@enos-n014:~$

It is interesting that once the job starts on the "wrong" node, enos-n014 in 
the above example, it has the field Reservation equal to phd.

> scontrol show job 1816915 | grep -i reser
Reservation=phd

If I remove OVERLAP (or user1) from the phd reservation, then user1 cannot run 
jobs on enos-n014.

We run SLURM 22.05.9

Any suggestion?. I would appreciate any help.

Thank you.
Gizo



[slurm-users] Slurm version 23.02.4 is now available

2023-07-27 Thread Tim McMullan

We are pleased to announce the availability of Slurm version 23.02.4.

The 23.02.4 release includes a number of fixes to Slurm stability and 
various bug fixes.  Some notable fixes include fixing the main scheduler 
loop not starting on the backup controller after a failover event, a 
segfault when attempting to use AccountingStorageExternalHost, and an 
issue where steps could continue running indefinitely if the slurmctld 
takes too long to respond.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

-Tim

--
Tim McMullan
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support



* Changes in Slurm 23.02.4
==
 -- Fix sbatch return code when --wait is requested on a job array.
 -- switch/hpe_slingshot - avoid segfault when running with old libcxi.
 -- Avoid slurmctld segfault when specifying AccountingStorageExternalHost.
 -- Fix collected GPUUtilization values for acct_gather_profile plugins.
 -- Fix slurmrestd handling of job hold/release operations.
 -- Make spank S_JOB_ARGV item value hold the requested command argv instead of
the srun --bcast value when --bcast requested (only in local context).
 -- Fix step running indefinitely when slurmctld takes more than MessageTimeout
to respond. Now, slurmctld will cancel the step when detected, preventing
following steps from getting stuck waiting for resources to be released.
 -- Fix regression to make job_desc.min_cpus accurate again in job_submit when
requesting a job with --ntasks-per-node.
 -- scontrol - Permit changes to StdErr and StdIn for pending jobs.
 -- scontrol - Reset std{err,in,out} when set to empty string.
 -- slurmrestd - mark environment as a required field for job submission
descriptions.
 -- slurmrestd - avoid dumping null in OpenAPI schema required fields.
 -- data_parser/v0.0.39 - avoid rejecting valid memory_per_node formatted as
dictionary provided with a job description.
 -- data_parser/v0.0.39 - avoid rejecting valid memory_per_cpu formatted as
dictionary provided with a job description.
 -- slurmrestd - Return HTTP error code 404 when job query fails.
 -- slurmrestd - Add return schema to error response to job and license query.
 -- Fix handling of ArrayTaskThrottle in backfill.
 -- Fix regression in 23.02.2 when checking gres state on slurmctld startup or
reconfigure. Gres changes in the configuration were not updated on slurmctld
startup. On startup or reconfigure, these messages were present in the log:
"error: Attempt to change gres/gpu Count".
 -- Fix potential double count of gres when dealing with limits.
 -- switch/hpe_slingshot - support alternate traffic class names with "TC_"
prefix.
 -- scrontab - Fix cutting off the final character of quoted variables.
 -- Fix slurmstepd segfault when ContainerPath is not set in oci.conf
 -- Change the log message warning for rate limited users from debug to verbose.
 -- Fixed an issue where jobs requesting licenses were incorrectly rejected.
 -- smail - Fix issues where e-mails at job completion were not being sent.
 -- scontrol/slurmctld - fix comma parsing when updating a reservation's nodes.
 -- cgroup/v2 - Avoid capturing log output for ebpf when constraining devices,
as this can lead to inadvertent failure if the log buffer is too small.
 -- Fix --gpu-bind=single binding tasks to wrong gpus, leading to some gpus
having more tasks than they should and other gpus being unused.
 -- Fix main scheduler loop not starting after failover to backup controller.
 -- Added error message when attempting to use sattach on batch or extern steps.
 -- Fix regression in 23.02 that causes slurmstepd to crash when srun requests
more than TreeWidth nodes in a step and uses the pmi2 or pmix plugin.
 -- Reject job ArrayTaskThrottle update requests from unprivileged users.
 -- data_parser/v0.0.39 - populate description fields of property objects in
generated OpenAPI specifications where defined.
 -- slurmstepd - Avoid segfault caused by ContainerPath not being terminated by
'/' in oci.conf.
 -- data_parser/v0.0.39 - Change v0.0.39_job_info response to tag exit_code
field as being complex instead of only an unsigned integer.
 -- job_container/tmpfs - Fix %h and %n substitution in BasePath where %h was
substituted as the NodeName instead of the hostname, and %n was substituted
as an empty string.
 -- Fix regression where --cpu-bind=verbose would override TaskPluginParam.
 -- scancel - Fix --clusters/-M for federations. Only filtered jobs (e.g. -A,
-u, -p, etc.) from the specified clusters will be canceled, rather than all
jobs in the federation. Specific jobids will still be routed to the origin
cluster for cancellation.