[slurm-users] Slurm version 24.05.4 is now available (CVE-2024-48936)

2024-10-23 Thread Tim Wickberg via slurm-users
Slurm version 24.05.4 is now available and includes a fix for a recently 
discovered security issue with the new stepmgr subsystem.


SchedMD customers were informed on October 9th and provided a patch on
request; this process is documented in our security policy. [1]

A mistake in authentication handling in stepmgr could permit an attacker 
to execute processes under other users' jobs. This is limited to jobs 
explicitly running with --stepmgr, or on systems that have globally 
enabled stepmgr through "SlurmctldParameters=enable_stepmgr" in their 
configuration. CVE-2024-48936.


Downloads are available at https://www.schedmd.com/downloads.php .

Release notes follow below.

- Tim

[1] https://www.schedmd.com/security-policy/

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support


* Changes in Slurm 24.05.4
==
 -- Fix generic int sort functions.
 -- Fix user look up using possible unrealized uid in the dbd.
 -- Fix FreeBSD compile issue with tls/none plugin.
 -- slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser
when SlurmUser was not root.
 -- mpi/pmix fix race conditions with het jobs at step start/end which could
make srun to hang.
 -- Fix not showing some SelectTypeParameters in scontrol show config.
 -- Avoid assert when dumping removed certain fields in JSON/YAML.
 -- Improve how shards are scheduled with affinity in mind.
 -- Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set
in the same QOS.
 -- Prevent backfill from planning jobs that use overlapping resources for the
same time slot if the job's time limit is less than bf_resolution.
 -- Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu.
 -- Prevent backfill from breaking out due to "system state changed" every 30
seconds if reservations use REPLACE or REPLACE_DOWN flags.
 -- slurmrestd - Make sure that scheduler_unset parameter defaults to true even
when the following flags are also set: show_duplicates, skip_steps,
disable_truncate_usage_time, run_away_jobs, whole_hetjob,
disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time,
show_batch_script, and or show_job_environment. Additionaly, always make
sure show_duplicates and disable_truncate_usage_time default to true when
the following flags are also set: scheduler_unset, scheduled_on_submit,
scheduled_by_main, scheduled_by_backfill, and or job_started. This effects
the following endpoints:
  'GET /slurmdb/v0.0.40/jobs'
  'GET /slurmdb/v0.0.41/jobs'
 -- Ignore --json and --yaml options for scontrol show config to prevent mixing
output types.
 -- Fix not considering nodes in reservations with Maintenance or Overlap flags
when creating new reservations with nodecnt or when they replace down nodes.
 -- Fix suspending/resuming steps running under a 23.02 slurmstepd process.
 -- Fix options like sprio --me and squeue --me for users with a uid greater
than 2147483647.
 -- fatal() if BlockSizes=0. This value is invalid and would otherwise cause the
slurmctld to crash.
 -- sacctmgr - Fix issue where clearing out a preemption list using
preempt='' would cause the given qos to no longer be preempt-able until set
again.
 -- Fix stepmgr creating job steps concurrently.
 -- data_parser/v0.0.40 - Avoid dumping "Infinity" for NO_VAL tagged "number"
fields.
 -- data_parser/v0.0.41 - Avoid dumping "Infinity" for NO_VAL tagged "number"
fields.
 -- slurmctld - Fix a potential leak while updating a reservation.
 -- slurmctld - Fix state save with reservation flags when a update fails.
 -- Fix reservation update issues with parameters Accounts and Users, when
using +/- signs.
 -- slurmrestd - Don't dump warning on empty wckeys in:
  'GET /slurmdb/v0.0.40/config'
  'GET /slurmdb/v0.0.41/config'
 -- Fix slurmd possibly leaving zombie processes on start up in configless when
the initial attempt to fetch the config fails.
 -- Fix crash when trying to drain a non-existing node (possibly deleted
before).
 -- slurmctld - fix segfault when calculating limit decay for jobs with an
invalid association.
 -- Fix IPMI energy gathering with multiple sensors.
 -- data_parser/v0.0.39 - Remove xassert requiring errors and warnings to have a
source string.
 -- slurmrestd - Prevent potential segfault when there is an error parsing an
array field which could lead to a double xfree. This applies to several
endpoints in data_parser v0.0.39, v0.0.40 and v0.0.41.
 -- scancel - Fix a regression from 23.11.6 where using both the --ctld and
--sibling options would cancel the federated job on all clusters instead of
only the cluster(s) specified by --sibling.
 -- accounting_storage/mysql - Fix bug when removing an association
specified with an empty partition.
 -- Fix setting multiple partition state restore on a job correctly.
 -- Fix difference in behavior when s

[slurm-users] loss of "unchangeable" node features

2024-10-23 Thread Laura Hild via slurm-users
Has anyone else noticed, somewhere between versions 22.05.11 and 23.11.9, 
losing fixed Features defined for a node in slurm.conf, and instead now just 
having those controlled by a NodeFeaturesPlugin like node_features/knl_generic?


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Randomly draining nodes

2024-10-23 Thread Ole Holm Nielsen via slurm-users

Hi Chris,

Thanks for confirming that UnkillableStepTimeout can have larger values 
without issues.  Do you have some suggestions for values that would safely 
cover network filesystem delays?


Best regards,
Ole

On 10/24/24 07:51, Christopher Samuel via slurm-users wrote:
Some time ago it was recommended that UnkillableStepTimeout values above 
127 (or 256?) should not be used, see https://support.schedmd.com/ 
show_bug.cgi?id=11103.  I don't know if this restriction is still valid 
with recent versions of Slurm?


As I read it that last comment includes a commit message for the fix to 
that problem, and we happily use a much longer timeout than that without 
apparent issue.


https://support.schedmd.com/show_bug.cgi?id=11103#c30


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com