[slurm-dev] Slurm versions 16.05.3 and 17.02.0-pre1 now available

jette Tue, 26 Jul 2016 16:16:51 -0700


Slurm version 16.05.3 is now available and includes about 30 bug fixes

developed over the past few weeks. We have also relesed the firstpre-releaseof version 17.02, which is under development and scheduled for releasein

February 2017. A description of the changes in each version is appended.


Slurm downloads are available from:
http://www.schedmd.com/#repos

* Changes in Slurm 16.05.3
==========================
 -- Make it so the extern step uses a reverse tree when cleaning up.

-- If extern step doesn't get added into the proctrack plugin make surethe

    sleep is killed.

-- Fix areas the slurmctld can segfault if an extern step is in thesystem

    cleaning up on a restart.

-- Prevent possible incorrect counting of GRES of a given type if anode hasthe multiple "types" of a given GRES "name", which couldover-subscribe

    GRES of a given type.

-- Add web links to Slurm Diamond Collectors (from Harvard University)and

    collectd (from EDF).
 -- Add job_submit plugin for the "reboot" field.

-- Make some more Slurm constants (INFINITE, NO_VAL64, etc.) availableto

    job_submit/lua plugins.

-- Send in a -1 for a taskid into spank_task_post_fork for theextern_step.-- MYSQL - Sightly better logic if a job completion comes in with anend time

    of 0.

-- task/cgroup plugin is configured with ConstrainRAMSpace=yes, thenset softmemory limit to allocated memory limit (previously no soft limit wasset).-- Document limitations in burst buffer use by the salloc command(possible

    access problems from a login node).
 -- Fix proctrack plugin to only add the pid of a process once
    (regression in 16.05.2).

-- Fix for sstat to print correct info when requesting jobid.batch aspart of

    a comma-separated list.

-- CRAY - Fix issue if pid has already been added to another jobcontainer.

 -- CRAY - Fix add of extern step to AELD.

-- burstbufer/cray: avoid batch submit error condition if waiting forstagein.-- CRAY - Fix for reporting steps lingering after they are alreadyfinished.-- Testsuite - fix test1.29 / 17.15 for limits with values above32-bits.

 -- CRAY - Simplify when a NHC is called on a step that has unkillable
    processes.

-- CRAY - If trying to kill a step and you have NHC_NO_STEPS set runNHC

    anyway to attempt to log the backtraces of the potential
    unkillable processes.

-- Fix gang scheduling and license release logic if single node jobkilled on

    bad node.
 -- Make scontrol show steps show the extern step correctly.
 -- Do not scheduled powered down nodes in FAILED state.

-- Do not start slurmctld power_save thread until partition informationis readin order to prevent race condition that can result invalid pointerwhen

    trying to resolve configured SuspendExcParts.

-- Add SLURM_PENDING_STEP id so it won't be confused withSLURM_EXTERN_CONT.

 -- Fix for core selection with job --gres-flags=enforce-binding option.

Previous logic would in some cases allocate a job zero cores,resulting in

    slurmctld abort.

-- Minimize preempted jobs for configurations with multiple jobs pernode.-- Improve partition AllowGroups caching. Update the table of UIDspermitted touse a partition based upon it's AllowGroups configuration parameteras newvalid UIDs are found rather than looking up that user's groupinformationfor every job they submit. If the user is now allowed to use thepartition,

    then do not check that user's group access again for 5 seconds.
 -- Add routing queue information to Slurm FAQ web page.

-- Do not select_g_step_finish() a SLURM_PENDING_STEP step, as nothinghas

    been allocated for the step yet.
 -- Fixed race condition in PMIx Fence logic.

-- Prevent slurmctld abort if job is killed or requeued while waitingfor

    reboot of its allocated compute nodes.

-- Treat invalid user ID in AllowUserBoot option of knl.conf file aserror

    rather than fatal (log and do not exit).

-- qsub - When doing the default output files for an array in qsubstyle

    make them using the master job ID instead of the normal job ID.

-- Create the extern step while creating the job instead of waitinguntil the

    end of the job to do it.

-- Always report a 0 exit code for the extern step instead of beingcanceled

    or failed based on the signal that would always be killing it.
 -- Fix to allow users to update QOS of pending jobs.
 -- Print correct cluster name in "slurmd -C" output.
 -- CRAY - Fix minor memory leak in switch plugin.
 -- CRAY - Change slurmconfgen_smw.py to skip over disabled nodes.
 -- Fix eligible_time for elasticsearch as well as add queue_wait
    (difference between start of job and when it was eligible).


* Changes in Slurm 17.02.0pre1
==============================

-- burst_buffer/cray - Add support for rounding up the size of a bufferreqeust

    if the DataWarp configuration "equalize_fragments" is used.
 -- Remove AIX support.

-- Rename "in" to "input" in slurm_step_io_fds data structure definedinslurm.h. This is needed to avoid breaking Python with by using oneof its

    keywords in a Slurm data structure.
 -- Remove eligible_time from jobcomp/elasticsearch.
 -- Fix issue where if no clusters were added but yet a QOS needed to be
    deleted make it possible.

-- SlurmDBD - change all timestamps to bigint from int to solve Y2038problem.-- Add salloc/sbatch/srun --spread-job to distribute tasks over as manynodesas possible. This also treats the --ntasks-node-node option as amaximum

    value.
 -- Add ConstrainKmemSpace to cgroup.conf, defaulting to yes, to allow

cgroup Kmem enforcement to be disabled while still usingConstrainRAMSpace.

 -- Add support for sbatch --bbf option.

-- Add burst buffer support for job arrays. Add new SchedulerParametersoptionof bb_array_stage_cnt=# to indicate how many pending tasks of a jobarray

    should be made available for burst buffer resource allocation.
 -- Fix small memory leak when a job fails to load from state save.

-- Fix invalid read when attempting to delete clusters from db withrunning

    jobs.
 -- Fix small memory leak when deleting clusters from db.

-- Add SLURM_ARRAY_TASK_COUNT environment variable. Total number oftasks in a

    job array (e.g. "--array=2,4,8" will set SLURM_ARRAY_TASK_COUNT=3).

-- Add new sacctmgr commands: "shutdown" (shutdown the server), "liststats"

    (get server statistics) "clear stats" (clear server statistics).

-- Restructure job accounting query to use 'id_job in (1, 2, .. )'format

    instead of logically equivalent 'id_job = 1 || id_job = 2 || ..' .
 -- Added start_delay field to jobcomp/elasticsearch.

-- In order to support federated jobs, the MaxJobID configurationparameterdefault value has been reduced from 2,147,418,112 to 67,043,328 anditsmaximum value is now 67,108,863. Upon upgrading, any pre-existingjobs thathave a job ID above the new range will continue to run and new jobswill get

    job IDs in the new range.

-- Added infrastructure for setting up federations in database andestablishing

    connections between federation clusters.

[slurm-dev] Slurm versions 16.05.3 and 17.02.0-pre1 now available

Reply via email to