We are pleased to announce the availability of Slurm version 23.11.4.

The 23.11.4 release includes a number of fixes to stability and various bug fixes. Some notable changes include that VSZ is no longer being reported when using cgroup/v2 (this is not provided by the kernel), a warning has been added if using select/linear and tolology/tree together as this will not be supported in the next major release, and a backwards compatibility issue that caused jobs using --gpus to be rejected when submitted from 23.02 or 22.05.

Slurm can be downloaded from https://www.schedmd.com/downloads.php .

-Tim

* Changes in Slurm 23.11.4
==========================
 -- Fix a memory leak when updating partition nodes.
 -- Don't leave a partition around if it fails to create with scontrol.
 -- Fix segfault when creating partition with bad node list from scontrol.
 -- Fix preserving partition nodes on bad node list update from scontrol.
 -- Fix assertion in developer mode on a failed message unpack.
 -- Fix repeat POWER_DOWN requests making the nodes available for ping.
 -- Fix rebuilding job alias_list on restart when nodes are still powering up.
 -- Fix INVALID nodes running health check.
 -- Fix cloud/future nodes not setting addresses on invalid registration.
 -- scrun - Remove the requirement to set the SCRUN_WORKING_DIR environment
    variable. This was a regression in 23.11.
 -- Add warning for using select/linear with topology/tree.
    This combination will not be supported in the next major version.
 -- Fix health check program not being run after first pass of all nodes when
    using MaxNodeCount.
 -- sacct - Set process exit code to one for all errors.
 -- Add SlurmctldParameters=disable_triggers option.
 -- Fix issue running steps when the allocation requested an exclusive
    allocation shards along with shards.
 -- Fix cleaning up the sleep process and the cgroup of the extern step if
    slurm_spank_task_post_fork returns an error.
 -- slurm_completion - Add missing --gres-flags= options
    multiple-tasks-per-sharing and one-task-per-sharing.
 -- scrun - Avoid race condition that could cause outbound network
    communications to incorrectly rejected with an incomplete packet error.
 -- scrun - Gracefully handle kernel giving invalid expected number of incoming
    bytes for a connection causing incoming packet corruption resulting in
    connection getting closed.
 -- srun - return 1 when a step lauch fails
 -- scrun - Avoid race condition that could cause deadlock during shutdown.
 -- Fix scontrol listpids to work under dynamic node scenarios.
 -- Add --tres-bind to --help and --usage output.
 -- Add --gres-flags=allow-task-sharing to allow GPUs to still be accessible
    among all tasks when binding GPUs to specific tasks.
 -- Fix issue with CUDA_VISIBLE_DEVICES showing the same MIG device for all
    tasks when using MIGs with --tres-per-task or --gpus-per-task.
 -- slurmctld - Prevent a potential hang during shutdown/reconfigure if the
    association cache thread was previously shut down.
 -- scrun - Avoid race condition that could cause scrun to hang during
    shutdown when connections have pending events.
 -- scrun - Avoid excessive polling of connections during shutdown that could
    needlessly cause 100% CPU usage on a thread.
 -- sbcast - Use user identity from broadcast credential instead of looking it
    up locally on the node.
 -- scontrol - Remove "abort" option handling.
 -- Fix an error message referring to the wrong RPC.
 -- Fix memory leak on error when creating dynamic nodes.
 -- Fix a slurmctld segfault when a cloud/dynamic node changes hostname on
    registration.
 -- Prevent a slurmctld deadlock if the gpu plugin fails to load when
    creating a node.
 -- Change a slurmctld fatal() to an error() when attempting to create a
    dynamic node with a global autodetect set in gres.conf.
 -- Fix leaving node records on error when creating nodes with scontrol.
 -- scrun/sackd - Avoid race condition where shutdown could deadlock.
 -- Fix a regression in 23.02.5 that caused pam_slurm_adopt to fail when
    the user has multiple jobs on a node.
 -- Add GLOB_SILENCE flag that silences the error message which will display if
    an include directive attempts to use the "*" wildcard.
 -- Fix jobs getting rejected when submitting with --gpus option from older
    versions of job submission commands (23.02 and older).
 -- cgroup/v2 - Return 0 for VSZ. Kernel cgroups do not provide this metric.
 -- scrun - Avoid race condition where outbound RPCs could be corrupted.
 -- scrun - Avoid race condition that could cause a crash while compiled in
    debug mode.
 -- gpu/rsmi - Disable gpu usage statistics when not using ROCM 6.0.0+
 -- Fix stuck processes and incorrect environment when using --get-user-env.
 -- Avoid segfault in the slurmdbd when TrackWCKey=no but you are still using
    use WCKeys.
 -- Fix ctld segfault with TopologyParam=RoutePart and no partition defined.
 -- slurmctld - Fix missing --deadline handling for jobs not evaluated by the
    schedulers (i.e. non-runnable, skipped for other reasons, etc.).
 -- Demote some eio related logs from error to verbose in user commands.  These
    are not generally actionable by the user and are easilly generated by port
    scanning a machine running srun.
 -- Make sprio correctly print array tasks that have not yet been split out.
 -- topology/block - Restrict the number of last-level blocks in any allocation.

--
Tim McMullan
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to