We are pleased to announce the availability of Slurm versions 24.11.2 and 24.05.6.

24.11.2 fixes a variety of minor to major bugs. Fixed regressions include loading non-default QOS on pending jobs from pre-24.11 state, pending jobs displaying QOS=(null) when not explicitly requesting a QOS, running jobs that requested multiple partitions potentially having an incorrect partition when slurmctld is restarted, and burst_buffer.lua failing if slurm.conf is in a non-standard location. This release also fixes a few crashes in slurmctld: crashing when a job that can preempt requests --test-only, crasing when the scheduler evaluates a job on nodes with suspended jobs, and crashing due to a long-standing bug causing a job record without job_resrcs.

24.05.6 fixes sattach with auth/slurm, a slurmrestd crash when using data_parser/v0.0.40, a slurmctld crash when using job suspension, a performance regression for RPCs with large amounts of data, and some other moderate severity bugs.

Downloads are available at https://www.schedmd.com/downloads.php .

--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

* Changes in Slurm 24.11.2
==========================
 -- Fix segfault when submitting --test-only jobs that can preempt.
 -- Fix regression introduced in 23.11 that prevented the following
    flags from being added to a reservation on an update:
    DAILY, HOURLY, WEEKLY, WEEKDAY, and WEEKEND.
 -- Fix crash and issues evaluating job's suitability for running in
    nodes with already suspended job(s) there.
 -- Slurmctld will ensure that healthy nodes are not reported as
    UnavailableNodes in job reason codes.
 -- Fix handling of jobs submitted to a current reservation with
    flags OVERLAP,FLEX or OVERLAP,ANY_NODES when it overlaps nodes with a
    future maintenance reservation. When a job submission had a time limit that
    overlapped with the future maintenance reservation, it was rejected. Now
    the job is accepted but stays pending with the reason "ReqNodeNotAvail,
    Reserved for maintenance".
 -- pam_slurm_adopt - avoid errors when explicitly setting
    some arguments to the default value.
 -- Fix qos preemption with PreemptMode=SUSPEND
 -- slurmdbd - When changing a user's name update lineage
    at the same time.
 -- Fix regression in 24.11 in which burst_buffer.lua does not
    inherit the SLURM_CONF environment variable from slurmctld and fails to run
    if slurm.conf is in a non-standard location.
 -- Fix memory leak in slurmctld if select/linear and the
    PreemptParameters=reclaim_licenses options are both set in slurm.conf.
    Regression in 24.11.1.
 -- Fix running jobs, that requested multiple partitions, from
    potentially being set to the wrong partition on restart.
 -- switch/hpe_slingshot - Fix compatibility with newer cxi
    drivers, specifically when specifying disable_rdzv_get.
 -- Add ABORT_ON_FATAL environment variable to capture a backtrace
    from any fatal() message.
 -- Fix printing invalid address in rate limiting log statement.
 -- sched/backfill - Fix node state PLANNED not being cleared from
    fully allocated nodes during a backfill cycle.
 -- select/cons_tres - Fix future planning of jobs with bf_licenses.
 -- Prevent redundant "on_data returned rc: Rate limit exceeded,
    please retry momentarily" error message from being printed in
    slurmctld logs.
 -- Fix loading non-default QOS on pending jobs from pre-24.11 state.
 -- Fix pending jobs displaying QOS=(null) when not explicitly
    requesting a QOS.
 -- Fix segfault issue from job record with no job_resrcs
 -- Fix failing sacctmgr delete/modify/show account operations
    with where clauses.
 -- Fix regression in 24.11 in which Slurm daemons started catching
    several SIGTSTP, SIGTTIN and SIGUSR1 signals and ignored them, while before
    they were not ignoring them. This also caused slurmctld to not being
    able to shutdown after a SIGTSTP because slurmscriptd caught the signal
    and stopped while slurmctld ignored it. Unify and fix these situations and
    get back to the previous behavior for these signals.
 -- Document that SIGQUIT is no longer ignored by slurmctld,
    slurmdbd, and slurmd in 24.11. As of 24.11.0rc1, SIGQUIT is identical to
    SIGINT and SIGTERM for these daemons, but this change was not documented.
 -- Fix not considering nodes marked for reboot without ASAP
    in the scheduler.
 -- Remove the boot^ state on unexpected node reboot after
    return to service.
 -- Do not allow new jobs to start on a node which is being rebooted
    with the flag nextstate=resume.
 -- Prevent lower priority job running after cancelling an ASAP reboot.
 -- Fix srun jobs starting on nextstate=resume rebooting nodes.


* Changes in Slurm 24.05.6
==========================
 -- data_parser/v0.0.40 - Prevent a segfault in the slurmrestd when
    dumping data with v0.0.40+complex data parser.
 -- Fix sattach when using auth/slurm.
 -- scrun - Add support '--all' argument for kill subcommand.
 -- Fix performance regression while packing larger RPCs.
 -- Fix crash and issues evaluating job's suitability for running in
    nodes with already suspended job(s) there.
 -- Fixed a job requeuing issue that merged job entries into the
    same SLUID when all nodes in a job failed simultaneously.
 -- switch/hpe_slingshot - Fix compatibility with newer cxi
    drivers, specifically when specifying disable_rdzv_get.
 -- Add ABORT_ON_FATAL environment variable to capture a backtrace
    from any fatal() message.

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to