Slurm versions 20.02.3 and 19.05.7 are now available, and include a
series of recent bug fixes, as well as a fix for a security issue with
the optional message aggregation feature.
SchedMD customers were informed on May 7th and provided a patch on
request; this process is documented in our security policy [1].
CVE-2020-12693:
A review of what was intended to be a minor cleanup patch uncovered an
underlying race condition for systems with Message Aggregation enabled.
This race condition could allow a user to launch a process as an
arbitrary user.
This is only an issue for systems with Message Aggregation enabled,
which we expect to be a small number of Slurm installations in practice.
Message Aggregation is off in Slurm by default, and is only enabled by
MsgAggregationParams=WindowMsgs=<msgs>, where <msgs> is greater than 1.
(Using Message Aggregation on your systems is not a recommended
configuration at this time, and we may retire this subsystem in a future
Slurm release in favor of other RPC aggregation techniques. Although
care must be taken before disabling this to avoid communication issues.)
Downloads are available at https://www.schedmd.com/downloads.php .
Release notes follow below.
- Tim
[1] https://www.schedmd.com/security.php
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
* Changes in Slurm 20.02.3
==========================
-- Factor in ntasks-per-core=1 with cons_tres.
-- Fix formatting in error message in cons_tres.
-- Fix calling stat on a NULL variable.
-- Fix minor memory leak when using reservations with flags=first_cores.
-- Fix gpu bind issue when CPUs=Cores and ThreadsPerCore > 1 on a node.
-- Fix --mem-per-gpu for heterogenous --gres requests.
-- Fix slurmctld load order in load_all_part_state().
-- Fix race condition not finding jobacct gather task cgroup entry.
-- Suppress error message when selecting nodes on disjoint topologies.
-- Improve performance of _pack_default_job_details() with large number of job
arguments.
-- Fix archive loading previous to 17.11 jobs per-node req_mem.
-- Fix regresion validating that --gpus-per-socket requires --sockets-per-node
for steps. Should only validate allocation requests.
-- error() instead of fatal() when parsing an invalid hostlist.
-- nss_slurm - fix potential deadlock in slurmstepd on overloaded systems.
-- cons_tres - fix --gres-flags=enforce-binding and related --cpus-per-gres.
-- cons_tres - Allocate lowest numbered cores when filtering cores with gres.
-- Fix getting system counts for named GRES/TRES.
-- MySQL - Fix for handing typed GRES for association rollups.
-- Fix step allocations when tasks_per_core > 1.
-- Fix allocating more GRES than requested when asking for multiple GRES types.
* Changes in Slurm 19.05.7
==========================
-- Fix handling of -m/--distribution options for across socket/2nd level by
task/affinity plugin.
-- Fix grp_node_bitmap error when slurmctld started before slurmdbd.
-- Fix compilation issues in GCC10.
-- Fix distributing job steps across idle nodes within a job.
-- Break infinite loop in cons_tres dealing with incorrect tasks per tres
request resulting in slurmctld hang.
-- priority/multifactor - gracefully handle NULL list of associations or array
of siblings when calculating FairTree fairshare.
-- Fix cons_tres --exclusive=user to allocate only requested number of CPUs.
-- Add MySQL deadlock detection and automatic retry mechanism.
-- Fix _verify_node_state memory requested as --mem-per-gpu DefMemPerGPU.
-- Factor in ntasks-per-core=1 with cons_tres.
-- Fix formatting in error message in cons_tres.
-- Fix gpu bind issue when CPUs=Cores and ThreadsPerCore > 1 on a node.
-- Fix --mem-per-gpu for heterogenous --gres requests.
-- Fix slurmctld load order in load_all_part_state().
-- Fix getting system counts for named GRES/TRES.
-- MySQL - Fix for handing typed GRES for association rollups.
-- Fix step allocations when tasks_per_core > 1.