We are pleased to announce the availability of Slurm version 22.05.6.
This includes a fix to core selection for steps which could result in random task launch failures, alongside a number of other moderate severity issues.
- Marshall -- Marshall Garey Release Management, Support, and Development SchedMD LLC - Commercial Slurm Development and Support
* Changes in Slurm 22.05.6 ========================== -- Fix a partition's DisableRootJobs=no from preventing root jobs from working. -- Fix the number of allocated cpus for an auto-adjustment case in which the job requests --ntasks-per-node and --mem (per-node) but the limit is MaxMemPerCPU. -- Fix POWER_DOWN_FORCE request leaving node in completing state. -- Do not count magnetic reservation queue records towards backfill limits. -- Clarify error message when --send-libs=yes or BcastParameters=send_libs fails to identify shared library files, and avoid creating an empty "<filename>_libs" directory on the target filesystem. -- Fix missing CoreSpec on dynamic nodes upon slurmctld restart. -- Fix node state reporting when using specialized cores. -- Fix number of CPUs allocated if --cpus-per-gpu used. -- Add flag ignore_prefer_validation to not validate --prefer on a job. -- Fix salloc/sbatch SLURM_TASKS_PER_NODE output environment variable when the number of tasks is not requested. -- Permit using wildcard magic cookies with X11 forwarding. -- cgroup/v2 - Add check for swap when running OOM check after task termination. -- Fix deadlock caused by race condition when disabling power save with a reconfigure. -- Fix memory leak in the dbd when container is sent to the database. -- openapi/dbv0.0.38 - correct dbv0.0.38_tres_info. -- Fix node SuspendTime, SuspendTimeout, ResumeTimeout being updated after altering partition node lists with scontrol. -- jobcomp/elasticsearch - fix data_t memory leak after serialization. -- Fix issue where '*' wasn't accepted in gpu/cpu bind. -- Fix SLURM_GPUS_ON_NODE for shared GPU gres (MPS, shards). -- Add SLURM_SHARDS_ON_NODE environment variable for shards. -- Fix srun error with overcommit. -- Fix bug in core selection for the default cyclic distribution of tasks across sockets, that resulted in random task launch failures. -- Fix core selection for steps requesting multiple tasks per core when allocation contains more cores than required for step. -- gpu/nvml - Fix MIG minor number generation when GPU minor number (/dev/nvidia[minor_number]) and index (as seen in nvidia-smi) do not match. -- Fix accrue time underflow errors after slurmctld reconfig or restart. -- Surpress errant errors from prolog_complete about being unable to locate "node:(null)". -- Fix issue where shards were selected from multiple gpus and failed to allocate. -- Fix step cpu count calculation when using --ntasks-per-gpu=. -- Fix overflow problems when validating array index parameters in slurmctld and prevent a potential condition causing slurmctld to crash. -- Remove dependency on json-c in slurmctld when running with power saving. Only the new "SLURM_RESUME_FILE" support relies on this, and it will be disabled if json-c support is unavailable instead.