We are pleased to announce the availability of Slurm version 24.11.4.

This release fixes a variety of major to minor severity bugs. Some edge cases that caused jobs to pend forever are fixed. Notable stability issues that are fixed include:
* slurmctld crashing upon receiving a certain heterogeneous job submission.
* slurmd crashing after a communications failure with a slurmstepd.
* A variety of race conditions related to receiving and processing connections, including one that resulted in the slurmd ignoring new RPC connections.

Downloads are available at https://www.schedmd.com/downloads.php .

--
Marshall Garey
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

 -- slurmctld,slurmrestd - Avoid possible race condition that could have caused
    process to crash when listener socket was closed while accepting a new
    connection.
 -- slurmrestd - Avoid race condition that could have resulted in address
    logged for a UNIX socket to be incorrect.
 -- slurmrestd - Fix parameters in OpenAPI specification for the following
    endpoints to have "job_id" field:
    GET /slurm/v0.0.40/jobs/state/
    GET /slurm/v0.0.41/jobs/state/
    GET /slurm/v0.0.42/jobs/state/
    GET /slurm/v0.0.43/jobs/state/
 -- slurmd - Fix tracking of thread counts that could cause incoming
    connections to be ignored after burst of simultaneous incoming connections
    that trigger delayed response logic.
 -- Stepmgr - Avoid unnecessary SRUN_TIMEOUT forwarding to stepmgr.
 -- Fix jobs being scheduled on higher weighted powered down nodes.
 -- Fix how backfill scheduler filters nodes from the available nodes based on
    exclusive user and mcs_label requirements.
 -- acct_gather_energy/{gpu,ipmi} - Fix potential energy consumption adjustment
    calculation underflow.
 -- acct_gather_energy/ipmi - Fix regression introduced in 24.05.5 (which
    introduced the new way of preserving energy measurements through slurmd
    restarts) when EnergyIPMICalcAdjustment=yes.
 -- Prevent slurmctld deadlock in the assoc mgr.
 -- Fix memory leak when RestrictedCoresPerGPU is enabled.
 -- Fix preemptor jobs not entering execution due to wrong calculation of
    accounting policy limits.
 -- Fix certain job requests that were incorrectly denied with node
    configuration unavailable error.
 -- slurmd - Avoid crash due when slurmd has a communications failure with
    slurmstepd.
 -- Fix memory leak when parsing yaml input.
 -- Prevent slurmctld from showing error message about PreemptMode=GANG being a
    cluster-wide option for `scontrol update part` calls that don't attempt to
    modify partition PreemptMode.
 -- Fix setting GANG preemption on partition when updating PreemptMode with
    scontrol.
 -- Fix CoreSpec and MemSpec limits not being removed from previously
    configured slurmd.
 -- Avoid race condition that could lead to a deadlock when slurmd, slurmstepd,
    slurmctld, slurmrestd or sackd have a fatal event.
 -- Fix jobs using --ntasks-per-node and --mem keep pending forever when the
    requested mem divided by the number of cpus will surpass the configured
    MaxMemPerCPU.
 -- slurmd - Fix address logged upon new incoming RPC connection from "INVALID"
    to IP address.
 -- Fix memory leak when retrieving reservations. This affects scontrol, sinfo,
    sview, and the following slurmrestd endpoints:
    'GET /slurm/{any_data_parser}/reservation/{reservation_name}'
    'GET /slurm/{any_data_parser}/reservations'
 -- Log warning instead of debuflags=conmgr gated log when deferring new
    incoming connections when number of active connections exceed
    conmgr_max_connections.
 -- Avoid race condition that could result in worker thread pool not activating
    all threads at once after a reconfigure resulting in lower utilization of
    available CPU threads until enough internal activity wakes up all threads
    in the worker pool.
 -- Avoid theoretical race condition that could result in new incoming RPC
    socket connections being ignored after reconfigure.
 -- slurmd - Avoid race condition that could result in a state where new
    incoming RPC connections will always be ignored.
 -- Add ReconfigFlags=KeepNodeStateFuture to restore saved FUTURE node state on
    restart and reconfig instead of reverting to FUTURE state. This will be
    made the default in 25.05.
 -- Fix case where hetjob submit would cause slurmctld to crash.
 -- Fix jobs using --cpus-per-gpu and --mem keep pending forever when the
    requested mem divided by the number of cpus will surpass the configured
    MaxMemPerCPU.
 -- Enforce that jobs using --mem and several --*-per-* options do not violate
    the MaxMemPerCPU in place.
 -- slurmctld - Fix use-cases of jobs incorrectly pending held when --prefer
    features are not initially satisfied.
 -- slurmctld - Fix jobs incorrectly held when --prefer not satisfied in some
    use-cases.
 -- Ensure RestrictedCoresPerGPU and CoreSpecCount don't overlap.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to