date:20230615

Re: [slurm-users] Disable --no-allocate support for a node/SlurmD

2023-06-15 Thread Alexander Grund


Hi,
Ah okay,  so your requirements include completely insulating (some) 
jobs from outside access, including root?

Correct.
I've seen this kind of requirements on e.g. working non-defaced 
medical data - generally a tough problem imo because this level of 
data security seems more or less incompatible with the idea of a 
multi-user HPC system.


I remember that this year's ZKI-AK Supercomputing spring meeting had 
Sebastian Krey from GWDG presenting the KISSKI ("KI-Servicezentrum für 
Sensible und Kritische Infrastrukturen", https://kisski.gwdg.de/ ) 
project, which works in this problem domain, are you involved in that? 
The setup with containerization and 'node hardening' sounds very 
similar to me.
Indeed. We (ZIH TU Dresden) are working together with Hendrik Nolte from 
GWDG to implement their concept of a "secure Workflow on HPC" on our system.
In short the idea here is to have nodes with additional (cryptographic) 
authentication of jobs.
I'm just double-checking alternatives for some details which may result 
in easier implementation of the concept.
Re "preventing the scripts from running": I'd say it's about as easy 
as to otherwise manipulate any job submission that goes through 
slurmctld (e.g. by editing slurm.conf), so without knowing your exact 
use case and requirements, I can't think of a simple solution.

The resource manager, i.e. slurmctld, and slurmd run on different machines.
There is a local copy of slurm.conf for slurmctld, and the node(s), i.e. 
slurmd, each using only the relevant parts. So the slurmd doesn't care 
about the submit plugins and slurmctld doesn't (need to) know about the 
Prolog, correct?
The idea in the workflow is that only the node itself needs to be 
considered secure and access to the node is only possible via the slurmd 
running on the node.
So that slurmd can be configured to always execute the Prolog (a local 
script) prior to each job and deny its execution on failed authentication.
Circumventing this authentication now requires modifying the slurm.conf 
on that node, which has to be considered impossible as an attacker with 
that capability could also modify anything else (e.g. the Prolog to 
remove the checks).


But the possibility of slurmd handling a `--no-alloc` job introduces a 
new way to circumvent the authentication.
Using the slurm.conf of the slurmctld effectively only disables requests 
to the slurmd to not run the Prolog (i.e. -Z flag), but if the slurmd 
somehow receives such an request it would still handle it. So now the 
security relies additionally on the security of the resource manager.
It would be more secure if slurmd on that node(s) could be configured to 
never skip the Prolog, even if the user seems to be privileged.
As the node could be rebooted prior to each job using a readonly image 
the security of each job can be ensured without any influence on the 
rest of the cluster.


So in summary: We don't want to trust the slurmctld (running somewhere 
else) but only the slurmd (running on the node) to always execute the 
Prolog.


I hope that explains it well enough.
Kind regards,
Alex



smime.p7s
Description: S/MIME Cryptographic Signature

[slurm-users] Slurm version 23.02.3 is now available

2023-06-15 Thread Tim McMullan


We are pleased to announce the availability of Slurm version 23.02.3.

The 23.02.3 release includes a number of fixes to Slurm stability,
including potential slurmctld crashes when the backup slurmctld takes
over. This also fixes some issues when using older versions of the
command line tools with a 23.02 controller.

Slurm can be downloaded from https://www.schedmd.com/downloads.php .

-Tim

--
Tim McMullan
Release Management, Support, and Development
SchedMD LLC - Commercial Slurm Development and Support



* Changes in Slurm 23.02.3
==
 -- Fix regression in 23.02.2 that ignored the partition DefCpuPerGPU setting
on the first pass of scheduling a job requesting --gpus --ntasks.
 -- openapi/dbv0.0.39/users - If a default account update failed, resulting in a
no-op, the query returned success without any warning. Now a warning is sent
back to the client that the default account wasn't modified.
 -- srun - fix issue creating regular and interactive steps because
*_PACK_GROUP* environment variables were incorrectly set on non-HetSteps.
 -- Fix dynamic nodes getting stuck in allocated states when reconfiguring.
 -- Avoid job write lock when nodes are dynamically added/removed.
 -- burst_buffer/lua - allow jobs to get scheduled sooner after
slurm_bb_data_in completes.
 -- mpi/pmix - fix regression introduced in 23.02.2 which caused PMIx shmem
backed files permissions to be incorrect.
 -- api/submit - fix memory leaks when submission of batch regular jobs or batch
HetJobs fails (response data is a return code).
 -- openapi/v0.0.39 - fix memory leak in _job_post_het_submit().
 -- Fix regression in 23.02.2 that set the SLURM_NTASKS environment variable
in sbatch jobs from --ntasks-per-node when --ntasks was not requested.
 -- Fix regression in 23.02 that caused sbatch jobs to set the wrong number
of tasks when requesting --ntasks-per-node without --ntasks, and also
requesting one of the following options: --sockets-per-node,
--cores-per-socket, --threads-per-core (or --hint=nomultithread), or
-B,--extra-node-info.
 -- Fix double counting suspended job counts on nodes when reconfiguring, which
prevented nodes with suspended jobs from being powered down or rebooted
once the jobs completed.
 -- Fix backfill not scheduling jobs submitted with --prefer and --constraint
properly.
 -- Avoid possible slurmctld segfault caused by race condition with already
completed slurmdbd_conn connections.
 -- Slurmdbd.conf checks included conf files for 0600 permissions
 -- slurmrestd - fix regression "oversubscribe" fields were removed from job
descriptions and submissions from v0.0.39 end points.
 -- accounting_storage/mysql - Query for indiviual QOS correctly when you have
more than 10.
 -- Add warning message about ignoring --tres-per-tasks=license when used
on a step.
 -- sshare - Fix command to work when using priority/basic.
 -- Avoid loading cli_filter plugins outside of salloc/sbatch/scron/srun. This
fixes a number of missing symbol problems that can manifest for executables
linked against libslurm (and not libslurmfull).
 -- Allow cloud_reg_addrs to update dynamically registered node's addrs on
subsequent registrations.
 -- switch/hpe_slingshot - Fix hetjob components being assigned different vnis.
 -- Revert a change in 22.05.5 that prevented tasks from sharing a core if
--cpus-per-task > threads per core, but caused incorrect accounting and cpu
binding. Instead, --ntasks-per-core=1 may be requested to prevent tasks from
sharing a core.
 -- Correctly send assoc_mgr lock to mcs plugin.
 -- Fix regression in 23.02 leading to error() messages being sent at INFO
instead of ERR in syslog.
 -- switch/hpe_slingshot - Fix bad instant-on data due to incorrect parsing of
data from jackaloped.
 -- Fix TresUsageIn[Tot|Ave] calculation for gres/gpumem and gres/gpuutil.
 -- Avoid unnecessary gres/gpumem and gres/gpuutil TRES position lookups.
 -- Fix issue in the gpu plugins where gpu frequencies would only be set if both
gpu memory and gpu frequencies were set, while one or the other suffices.
 -- Fix reservations group ACL's not working with the root group.
 -- slurmctld - Fix backup slurmctld crash when it takes control multiple times.
 -- Fix updating a job with a ReqNodeList greater than the job's node count.
 -- Fix inadvertent permission denied error for --task-prolog and --task-epilog
with filesystems mounted with root_squash.
 -- switch/hpe_slingshot - remove the unused vni_pids option.
 -- Fix missing detailed cpu and gres information in json/yaml output from
scontrol, squeue and sinfo.
 -- Fix regression in 23.02 that causes a failure to allocate job steps that
request --cpus-per-gpu and gpus with types.
 -- sacct - when printing PLANNED time, use end time instead of start time for
jobs cancelled before they started.
 -- Fix potentially waiting indefinitely for a defunct proc

[slurm-users] Fwd: task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-15 Thread Tim Schneider


Hi,

I am maintaining the SLURM cluster of my research group. Recently I 
updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am unable to 
launch jobs. When launching a job, I receive the following error:


/$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G 
--time=01:00:00 --pty -p amd -w cn02 --pty bash -i//

//srun: error: task 0 launch failed: Plugin initialization failed/

Strangely, I cannot find any indication of this problem in the logs 
(find the logs attached). The problem must be related to the task/cgroup 
plugin, as it does not occur when I disable it.


After reading in the documentation, I tried adding the 
/cgroup_enable=memory swapaccount=1/ kernel parameters, but the problem 
persisted.


I would be very grateful for any advice where to look since I have no 
idea how to investigate this issue further.


Thanks a lot in advance.

Best,

Tim

###
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainKmemSpace=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
# This will be necessary for controlling GPU access
ConstrainDevices=yes
#
# slurmd -D -vv --conf-server nas:6817
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:16 ThreadsPerCore:1
slurmd: debug4: CPU map[0]=>0 S:C:T 0:0:0
slurmd: debug4: CPU map[1]=>1 S:C:T 0:1:0
slurmd: debug4: CPU map[2]=>2 S:C:T 0:2:0
slurmd: debug4: CPU map[3]=>3 S:C:T 0:3:0
slurmd: debug4: CPU map[4]=>4 S:C:T 0:4:0
slurmd: debug4: CPU map[5]=>5 S:C:T 0:5:0
slurmd: debug4: CPU map[6]=>6 S:C:T 0:6:0
slurmd: debug4: CPU map[7]=>7 S:C:T 0:7:0
slurmd: debug4: CPU map[8]=>8 S:C:T 0:8:0
slurmd: debug4: CPU map[9]=>9 S:C:T 0:9:0
slurmd: debug4: CPU map[10]=>10 S:C:T 0:10:0
slurmd: debug4: CPU map[11]=>11 S:C:T 0:11:0
slurmd: debug4: CPU map[12]=>12 S:C:T 0:12:0
slurmd: debug4: CPU map[13]=>13 S:C:T 0:13:0
slurmd: debug4: CPU map[14]=>14 S:C:T 0:14:0
slurmd: debug4: CPU map[15]=>15 S:C:T 0:15:0
slurmd: debug3: _set_slurmd_spooldir: initializing slurmd spool directory `/var/spool/slurmd`
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:16 ThreadsPerCore:1
slurmd: debug4: CPU map[0]=>0 S:C:T 0:0:0
slurmd: debug4: CPU map[1]=>1 S:C:T 0:1:0
slurmd: debug4: CPU map[2]=>2 S:C:T 0:2:0
slurmd: debug4: CPU map[3]=>3 S:C:T 0:3:0
slurmd: debug4: CPU map[4]=>4 S:C:T 0:4:0
slurmd: debug4: CPU map[5]=>5 S:C:T 0:5:0
slurmd: debug4: CPU map[6]=>6 S:C:T 0:6:0
slurmd: debug4: CPU map[7]=>7 S:C:T 0:7:0
slurmd: debug4: CPU map[8]=>8 S:C:T 0:8:0
slurmd: debug4: CPU map[9]=>9 S:C:T 0:9:0
slurmd: debug4: CPU map[10]=>10 S:C:T 0:10:0
slurmd: debug4: CPU map[11]=>11 S:C:T 0:11:0
slurmd: debug4: CPU map[12]=>12 S:C:T 0:12:0
slurmd: debug4: CPU map[13]=>13 S:C:T 0:13:0
slurmd: debug4: CPU map[14]=>14 S:C:T 0:14:0
slurmd: debug4: CPU map[15]=>15 S:C:T 0:15:0
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so
slurmd: debug:  gres/gpu: init: loaded
slurmd: debug3: Success.
slurmd: debug3: _merge_gres2: From gres.conf, using gpu:rtx2080:1:/dev/nvidia0
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/gpu_generic.so
slurmd: debug:  gpu/generic: init: init: GPU Generic plugin loaded
slurmd: debug3: Success.
slurmd: debug3: gres_device_major : /dev/nvidia0 major 195, minor 0
slurmd: Gres Name=gpu Type=rtx2080 Count=1
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/topology_none.so
slurmd: topology/none: init: topology NONE plugin loaded
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/route_default.so
slurmd: route/default: init: route default plugin loaded
slurmd: debug3: Success.
slurmd: debug2: Gathering cpu frequency information for 16 cpus
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug3: NodeName= cn02
slurmd: debug3: TopoAddr= cn02
slurmd: debug3: TopoPattern = node
slurmd: debug3: ClusterName = iascluster
slurmd: debug3: Confile = `/var/spool/slurmd/conf-cache/slurm.conf'
slurmd: debug3: Debug   = 5
slurmd: debug3: CPUs= 16 (CF: 16, HW: 16)
slurmd: debug3: Boards  = 1  (CF:  1, HW:  1)
slurmd: debug3: Sockets = 1  (CF:  1, HW:  1)
slurmd: debug3: Cores   = 16 (CF: 16, HW: 16)
slurmd: debug3: Threads = 1  (CF:  1, HW:  1)
slurmd: debug3: UpTime  = 2377 = 00:39:37
slurmd: debug3: Block Map   = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
slurmd: debug3: Inverse Map = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
slurmd: debug3: RealMemory  = 64216
slurmd: debug3: TmpDisk = 32108
slurmd: debug3: Epilog  =

[slurm-users] SLUG Early Bird Ends Tomorrow!

2023-06-15 Thread Victoria Hobson

Early Bird registration for Slurm User Group 2023 ends tomorrow,
Friday, June 16th! This
year’s SLUG event will take place September 12th - 13th at Brigham
Young University with a Welcome Reception at the Provo Marriott Hotel
and Conference Center on the evening of Monday, September 11th.

Registration includes the Monday evening reception and both days of
main conference activity. SLUG Early Bird registration ends Friday,
June 16th.

Register now: https://www.eventbrite.com/e/631240546467

SLUG 2023 Call for Papers also ends tomorrow. All interested parties
should send an abstract to sl...@schedmd.com by EOD, Friday, June
16th.

SchedMD has secured a room block at the Provo Marriott and Conference
Center at a discounted rate of 139 USD/night. This rate is good for
check in on September 11th and checkout on September 13th. This
discount is available until Monday, August 14th on a first come first
serve basis.

Book your Provo Marriott stay now:
https://www.marriott.com/events/start.mi?id=1685745988857&key=GRP

For more information on other hotels and travel information check out
the SLUG registration page.

--
Victoria Hobson
Vice President of Marketing
SchedMD LLC

Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-15 Thread Reed Dier

I don’t have any direct advice off-hand, but I figure I will try to help steer 
the conversation in the right direction for figuring it out.

I’m going to assume that since you mention 21.08.5, that this means you are 
using the slurm-wlm packages from the ubuntu repos, and not building yourself?

And have all the components (slurmctld(s), slurmdbd, slurmd(s)) been upgraded 
as well?

The only thing that immediately comes to mind is that I remember reading a good 
bit about Ubuntu 22.04’s use of cgroups v2, which as I understand it are very 
different from cgroups v1, and plenty of people have had issues with v1/v2 
mismatches with slurm and other applications.

https://www.reddit.com/r/SLURM/comments/vjquih/error_cannot_find_cgroup_plugin_for_cgroupv2/
https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1
https://discuss.linuxcontainers.org/t/after-updated-to-more-recent-ubuntu-version-with-cgroups-v2-ubuntu-16-04-container-is-not-working-properly/14022

Hope that at least steers the conversation in a good direction.

Reed

> On Jun 15, 2023, at 5:04 PM, Tim Schneider  
> wrote:
> 
> Hi,
> I am maintaining the SLURM cluster of my research group. Recently I updated 
> to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am unable to launch jobs. 
> When launching a job, I receive the following error:
> 
> $ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G --time=01:00:00 
> --pty -p amd -w cn02 --pty bash -i
> srun: error: task 0 launch failed: Plugin initialization failed
> 
> Strangely, I cannot find any indication of this problem in the logs (find the 
> logs attached). The problem must be related to the task/cgroup plugin, as it 
> does not occur when I disable it.
> 
> After reading in the documentation, I tried adding the cgroup_enable=memory 
> swapaccount=1 kernel parameters, but the problem persisted.
> 
> I would be very grateful for any advice where to look since I have no idea 
> how to investigate this issue further.
> 
> Thanks a lot in advance.
> 
> Best,
> 
> Tim
> 
> 
> 
> 

smime.p7s
Description: S/MIME cryptographic signature

Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-15 Thread abel pinto

Indeed, the issue seems to be that Ubuntu 22.04 does not support cgroups v1 
anymore. Does SLURM support cgroupsv2? It seems so: 
https://slurm.schedmd.com/cgroup_v2.html

/Abel

> On Jun 15, 2023, at 20:20, Reed Dier  wrote:
> 
> I don’t have any direct advice off-hand, but I figure I will try to help 
> steer the conversation in the right direction for figuring it out.
> 
> I’m going to assume that since you mention 21.08.5, that this means you are 
> using the slurm-wlm packages from the ubuntu repos, and not building yourself?
> 
> And have all the components (slurmctld(s), slurmdbd, slurmd(s)) been upgraded 
> as well?
> 
> The only thing that immediately comes to mind is that I remember reading a 
> good bit about Ubuntu 22.04’s use of cgroups v2, which as I understand it are 
> very different from cgroups v1, and plenty of people have had issues with 
> v1/v2 mismatches with slurm and other applications.
> 
> https://www.reddit.com/r/SLURM/comments/vjquih/error_cannot_find_cgroup_plugin_for_cgroupv2/
> https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1
> https://discuss.linuxcontainers.org/t/after-updated-to-more-recent-ubuntu-version-with-cgroups-v2-ubuntu-16-04-container-is-not-working-properly/14022
> 
> Hope that at least steers the conversation in a good direction.
> 
> Reed
> 
>> On Jun 15, 2023, at 5:04 PM, Tim Schneider  
>> wrote:
>> 
>> Hi,
>> I am maintaining the SLURM cluster of my research group. Recently I updated 
>> to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am unable to launch 
>> jobs. When launching a job, I receive the following error:
>> 
>> $ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G --time=01:00:00 
>> --pty -p amd -w cn02 --pty bash -i
>> srun: error: task 0 launch failed: Plugin initialization failed
>> 
>> Strangely, I cannot find any indication of this problem in the logs (find 
>> the logs attached). The problem must be related to the task/cgroup plugin, 
>> as it does not occur when I disable it.
>> 
>> After reading in the documentation, I tried adding the cgroup_enable=memory 
>> swapaccount=1 kernel parameters, but the problem persisted.
>> 
>> I would be very grateful for any advice where to look since I have no idea 
>> how to investigate this issue further.
>> 
>> Thanks a lot in advance.
>> 
>> Best,
>> 
>> Tim
>> 
>> 
>> 
>> 
>

Re: [slurm-users] Disable --no-allocate support for a node/SlurmD

[slurm-users] Slurm version 23.02.3 is now available

[slurm-users] Fwd: task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

[slurm-users] SLUG Early Bird Ends Tomorrow!

Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

6 matches

Site Navigation

Mail list logo

Footer information