Slurm version 24.05.4 is now available and includes a fix for a recently
discovered security issue with the new stepmgr subsystem.
SchedMD customers were informed on October 9th and provided a patch on
request; this process is documented in our security policy. [1]
A mistake in authentication handling in stepmgr could permit an attacker
to execute processes under other users' jobs. This is limited to jobs
explicitly running with --stepmgr, or on systems that have globally
enabled stepmgr through "SlurmctldParameters=enable_stepmgr" in their
configuration. CVE-2024-48936.
Downloads are available at https://www.schedmd.com/downloads.php .
Release notes follow below.
- Tim
[1] https://www.schedmd.com/security-policy/
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
* Changes in Slurm 24.05.4
==========================
-- Fix generic int sort functions.
-- Fix user look up using possible unrealized uid in the dbd.
-- Fix FreeBSD compile issue with tls/none plugin.
-- slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser
when SlurmUser was not root.
-- mpi/pmix fix race conditions with het jobs at step start/end which could
make srun to hang.
-- Fix not showing some SelectTypeParameters in scontrol show config.
-- Avoid assert when dumping removed certain fields in JSON/YAML.
-- Improve how shards are scheduled with affinity in mind.
-- Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set
in the same QOS.
-- Prevent backfill from planning jobs that use overlapping resources for the
same time slot if the job's time limit is less than bf_resolution.
-- Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu.
-- Prevent backfill from breaking out due to "system state changed" every 30
seconds if reservations use REPLACE or REPLACE_DOWN flags.
-- slurmrestd - Make sure that scheduler_unset parameter defaults to true even
when the following flags are also set: show_duplicates, skip_steps,
disable_truncate_usage_time, run_away_jobs, whole_hetjob,
disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time,
show_batch_script, and or show_job_environment. Additionaly, always make
sure show_duplicates and disable_truncate_usage_time default to true when
the following flags are also set: scheduler_unset, scheduled_on_submit,
scheduled_by_main, scheduled_by_backfill, and or job_started. This effects
the following endpoints:
'GET /slurmdb/v0.0.40/jobs'
'GET /slurmdb/v0.0.41/jobs'
-- Ignore --json and --yaml options for scontrol show config to prevent mixing
output types.
-- Fix not considering nodes in reservations with Maintenance or Overlap flags
when creating new reservations with nodecnt or when they replace down nodes.
-- Fix suspending/resuming steps running under a 23.02 slurmstepd process.
-- Fix options like sprio --me and squeue --me for users with a uid greater
than 2147483647.
-- fatal() if BlockSizes=0. This value is invalid and would otherwise cause the
slurmctld to crash.
-- sacctmgr - Fix issue where clearing out a preemption list using
preempt='' would cause the given qos to no longer be preempt-able until set
again.
-- Fix stepmgr creating job steps concurrently.
-- data_parser/v0.0.40 - Avoid dumping "Infinity" for NO_VAL tagged "number"
fields.
-- data_parser/v0.0.41 - Avoid dumping "Infinity" for NO_VAL tagged "number"
fields.
-- slurmctld - Fix a potential leak while updating a reservation.
-- slurmctld - Fix state save with reservation flags when a update fails.
-- Fix reservation update issues with parameters Accounts and Users, when
using +/- signs.
-- slurmrestd - Don't dump warning on empty wckeys in:
'GET /slurmdb/v0.0.40/config'
'GET /slurmdb/v0.0.41/config'
-- Fix slurmd possibly leaving zombie processes on start up in configless when
the initial attempt to fetch the config fails.
-- Fix crash when trying to drain a non-existing node (possibly deleted
before).
-- slurmctld - fix segfault when calculating limit decay for jobs with an
invalid association.
-- Fix IPMI energy gathering with multiple sensors.
-- data_parser/v0.0.39 - Remove xassert requiring errors and warnings to have a
source string.
-- slurmrestd - Prevent potential segfault when there is an error parsing an
array field which could lead to a double xfree. This applies to several
endpoints in data_parser v0.0.39, v0.0.40 and v0.0.41.
-- scancel - Fix a regression from 23.11.6 where using both the --ctld and
--sibling options would cancel the federated job on all clusters instead of
only the cluster(s) specified by --sibling.
-- accounting_storage/mysql - Fix bug when removing an association
specified with an empty partition.
-- Fix setting multiple partition state restore on a job correctly.
-- Fix difference in behavior when swapping partition order in job submission.
-- Fix security issue in stepmgr that could permit an attacker to execute
processes under other users' jobs. CVE-2024-48936.
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com