We are pleased to announce the availability of Slurm version 23.02.4.
The 23.02.4 release includes a number of fixes to Slurm stability and various bug fixes. Some notable fixes include fixing the main scheduler loop not starting on the backup controller after a failover event, a segfault when attempting to use AccountingStorageExternalHost, and an issue where steps could continue running indefinitely if the slurmctld takes too long to respond.
Slurm can be downloaded from https://www.schedmd.com/downloads.php . -Tim -- Tim McMullan Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support
* Changes in Slurm 23.02.4 ========================== -- Fix sbatch return code when --wait is requested on a job array. -- switch/hpe_slingshot - avoid segfault when running with old libcxi. -- Avoid slurmctld segfault when specifying AccountingStorageExternalHost. -- Fix collected GPUUtilization values for acct_gather_profile plugins. -- Fix slurmrestd handling of job hold/release operations. -- Make spank S_JOB_ARGV item value hold the requested command argv instead of the srun --bcast value when --bcast requested (only in local context). -- Fix step running indefinitely when slurmctld takes more than MessageTimeout to respond. Now, slurmctld will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. -- Fix regression to make job_desc.min_cpus accurate again in job_submit when requesting a job with --ntasks-per-node. -- scontrol - Permit changes to StdErr and StdIn for pending jobs. -- scontrol - Reset std{err,in,out} when set to empty string. -- slurmrestd - mark environment as a required field for job submission descriptions. -- slurmrestd - avoid dumping null in OpenAPI schema required fields. -- data_parser/v0.0.39 - avoid rejecting valid memory_per_node formatted as dictionary provided with a job description. -- data_parser/v0.0.39 - avoid rejecting valid memory_per_cpu formatted as dictionary provided with a job description. -- slurmrestd - Return HTTP error code 404 when job query fails. -- slurmrestd - Add return schema to error response to job and license query. -- Fix handling of ArrayTaskThrottle in backfill. -- Fix regression in 23.02.2 when checking gres state on slurmctld startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: "error: Attempt to change gres/gpu Count". -- Fix potential double count of gres when dealing with limits. -- switch/hpe_slingshot - support alternate traffic class names with "TC_" prefix. -- scrontab - Fix cutting off the final character of quoted variables. -- Fix slurmstepd segfault when ContainerPath is not set in oci.conf -- Change the log message warning for rate limited users from debug to verbose. -- Fixed an issue where jobs requesting licenses were incorrectly rejected. -- smail - Fix issues where e-mails at job completion were not being sent. -- scontrol/slurmctld - fix comma parsing when updating a reservation's nodes. -- cgroup/v2 - Avoid capturing log output for ebpf when constraining devices, as this can lead to inadvertent failure if the log buffer is too small. -- Fix --gpu-bind=single binding tasks to wrong gpus, leading to some gpus having more tasks than they should and other gpus being unused. -- Fix main scheduler loop not starting after failover to backup controller. -- Added error message when attempting to use sattach on batch or extern steps. -- Fix regression in 23.02 that causes slurmstepd to crash when srun requests more than TreeWidth nodes in a step and uses the pmi2 or pmix plugin. -- Reject job ArrayTaskThrottle update requests from unprivileged users. -- data_parser/v0.0.39 - populate description fields of property objects in generated OpenAPI specifications where defined. -- slurmstepd - Avoid segfault caused by ContainerPath not being terminated by '/' in oci.conf. -- data_parser/v0.0.39 - Change v0.0.39_job_info response to tag exit_code field as being complex instead of only an unsigned integer. -- job_container/tmpfs - Fix %h and %n substitution in BasePath where %h was substituted as the NodeName instead of the hostname, and %n was substituted as an empty string. -- Fix regression where --cpu-bind=verbose would override TaskPluginParam. -- scancel - Fix --clusters/-M for federations. Only filtered jobs (e.g. -A, -u, -p, etc.) from the specified clusters will be canceled, rather than all jobs in the federation. Specific jobids will still be routed to the origin cluster for cancellation.