We are pleased to announce the availability of Slurm versions 18.08.1
and 17.11.10.
This includes an extensive set of fixes made since 18.08.0 was released
at the end of August, and for 17.11.10 since 17.11.9 was released at the
start of August.
Please note that the 17.11.10 release is expected to be the the last
maintenance release of that series (barring any critical security
issues) as our support team has shifted their attention to the 18.08
release. Also note that support for 17.02 ended in August; SchedMD
customers are encourage to upgrade to a supported major release (18.08
or 17.11) at their earliest convenience.
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
- Tim
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
* Changes in Slurm 18.08.1
==========================
-- Remove commented-out parts of man pages related to cons_tres work in 19.05,
as these were showing up on the web version due to a syntax error.
-- Prevent slurmctld performance issues in main background loop if multiple
backup controllers are unavailable.
-- Add missing user read association lock in burst_buffer/cray during init().
-- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
-- Fix creation of step hwloc xml file for after cpuset cgroup has been
created.
-- Add userspace as a valid default governor.
-- Add timers to group_cache_lookup so if going slow advise
LaunchParameters=send_gids.
-- Fix SLURM_STEP_GRES=none to work correctly.
-- Fix potential memory leak when a failure happens unpacking a ctld_multi_msg.
-- Fix potential double free when a faulure happens when unpacking a
node_registration_status_msg.
-- Fix sacctmgr show runaways.
-- Removed non-POSIX append operator from configure script for non-bash
support.
-- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
-- Fix sacct to not print huge reserve times when the job was never eligible.
-- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a
burst buffer.
-- burst_buffer/cray - Update burst buffers when an association or qos
is removed from the system.
-- Remove documentation for deprecated Cray/ALPS systems. Please switch to
Native Cray mode instead.
-- Completely copy features when copying the list in the slurmctld.
-- PMIX - Fix issue with packing processes when using an arbitrary task
distribution.
-- Fix hostlists to be able to handle nodenames with '-' in them surrounded
by integers.
-- Added sort option to sprio output.
-- Fix correct job CPU count allocated.
-- Fix sacctmgr setting GrpJobs limit when setting GrpJobsAccrue limit.
-- Change the defaults to MemLimitEnforce=no and NoOverMemoryKill
(See RELEASE_NOTES).
-- Prevent abort when using Cray node features plugin on non-knl.
-- Add ability to reboot down nodes with scontrol reboot_nodes.
-- Protect against sending to the slurmdbd if the connection has gone away.
-- Fix invalid read when not using backup slurmctlds.
-- Prevent acct coordinators from changing default acct on add user.
-- Don't allow scontrol top do modify job priorities when priority == 1.
-- slurmsmwd - change parsing code to handle systems with the svid or inst
fields set in xtconsumer output.
-- Fix infinite loop in slurmctld if GRES is specified without a count.
-- sacct: Print error when unknown arguments are found.
-- Fix checking missing return codes when unpacking structures.
-- Fix slurm.spec-legacy including slurmsmwd
-- More explicit error message when cgroup oom-kill events detected.
-- When updating an association and are unable to find parent association
initialize old fairshare association pointer correctly.
-- Wrap slurm_cond_signal() calls with mutexes where needed.
-- Fix correct timeout with resends in slurm_send_only_node_msg.
-- Fix pam_slurm_adopt to honor action_adopt_failure.
-- Have the slurmd recreate the hwloc xml file for the full system on restart.
-- sdiag - correct the units for the gettimeofday() stat to microseconds.
-- Set SLURM_CLUSTER_NAME environment variable in MailProg to the ClusterName.
-- smail - use SLURM_CLUSTER_NAME environment variable.
-- job_submit/lua - expose argc/argv options through lua interface.
-- slurmdbd - prevent false-positive warning about innodb settings having
been set too low if they're actually set over 2GB.
* Changes in Slurm 17.11.10
===========================
-- Move priority_sort_part_tier from slurmctld to libslurm to make it possible
to run the regression tests 24.* without changing that code since it links
directly to the priority plugin where that function isn't defined.
-- Fix issue where job time limits can increase to max walltime when updating
a job with scontrol.
-- Fix invalid protocol_version manipulation on big endian platforms causing
srun and sattach to fail.
-- Fix for QOS, Reservation and Alias env variables in srun.
-- mpi/pmi2 - Backport 6a702158b49c4 from 18.08 to avoid dangerous detached
thread.
-- When allowing heterogeneous steps make sure we copy all the options to
avoid copying strings that may be overwritten.
-- Print correctly when sh5util finds and empty file.
-- Fix sh5util to not seg fault on exit.
-- Fix sh5util to check correctly for H5free_memory.
-- Adjust OOM monitoring function in task/cgroup to prevent problems in
regression suite from leaked file descriptors.
-- Fix issue with gres when defined with a type and no count
(i.e. gres=gpu/tesla) it would get a count of 0.
-- Allow sstat to talk to slurmd's that are new in protocol version.
-- Permit database names over 33 characters in accounting_storage/mysql.
-- Fix negative values when profiling.
-- Fix srun segfault caused by invalid memory reads on the env.
-- Fix segfault on job arrays when starting controller without dbd up.
-- Fix pmi2 to build with gcc 8.0+.
-- Fix proper alignment of clauses when determining if more nodes are needed
for an allocation.
-- Fix race condition when canceling a federation job that just started
running.
-- Prevent extra resources from being allocated when combining certain flags.
-- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing
when using --hint=nomultithread.
-- Fix left over socket file when step is ending and using pmi2 with
%n or %h in the spool dir.
-- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
-- Fix sacct to not print huge reserve times when the job was never eligible.
-- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a
burst buffer.
-- burst_buffer/cray - Update burst buffers when an association or qos
is removed from the system.
-- If failed over to a backup controller, ensure the agent thread is launched
to handle deferred tasks.
-- Fix correct job CPU count allocated.
-- Protect against sending to the slurmdbd if the connection has gone away.
-- Fix checking missing return codes when unpacking structures.
-- Fix slurm.spec-legacy including slurmsmwd
-- More explicit error message when cgroup oom-kill events detected.
-- When updating an association and are unable to find parent association
initialize old fairshare association pointer correctly.
-- Wrap slurm_cond_signal() calls with mutexes where needed.
-- Fix correct timeout with resends in slurm_send_only_node_msg.
-- Fix pam_slurm_adopt to honor action_adopt_failure.
-- job_submit/lua - expose argc/argv options through lua interface.