[slurm-users] question about hyperthreaded CPUS, --hint=nomultithread and mutli-jobstep jobs

2023-05-23 Thread Hans van Schoot

Hi all,

I am getting some unexpected behavior with SLURM on a multithreaded CPU 
(AMD Ryzen 7950X), in combination with a job that uses multiple jobsteps 
and a program that prefers to run without hyperthreading.


My job consists of a simple shell script that does multiple srun 
executions, and normally (on non-multithreaded nodes) the srun commands 
will only start when resources are available inside my allocation. Example:


sbatch -N 1 -n 16 mytestjob.sh

mytestjob.sh contains:
srun -n 8 someMPIprog &
srun -n 8 someMPIprog &
srun -n 8 someMPIprog &
srun -n 8 someMPIprog &
wait

srun 1 and 2 will start immediately, srun 3 will start as soon as one of 
the first two jobsteps is finished, and srun 4 will again wait until 
some cores are available.


Now I would like this same behavior (no multithreading, one task per 
core) on a node with 16 multithreaded cores (32 cpus in SLURM 
ThreadsPerCore=2), so I submit with the following command:

sbatch -N 1 --hint=nomultithread -n 16 mytestjob.sh

Slurm correctly reserves the whole node for this, and srun without 
additional directions would launch someMPIprog with 16 MPI ranks.
Unfortunately in the multi jobstep situation, this causes all four srun 
iterations to start immediately, resulting in 4x 8 MPI ranks running at 
the same time, and thus multithreading. As I specified 
--hint=nomultithread, I would have expected the same behaviour as on the 
no-multithreaded node: srun 1 and 2 launch directly, and srun 3 and 4 
wait for CPU resources to become available.



So far I've been able to find two hacky ways of getting around this problem:
- do not use --hint=nomultithread, and instead limit using memory 
(--mem-per-cpu=4000). This is a bit ugly: it reserves half the compute 
node and seems to bind to the wrong CPU cores.
- set --cpus-per-task=2 instead of --hint=nomultithread, but this causes 
OpenMP to kick in if the MPI program supports it.



To me this feels like a bit of a bug in SLURM: I tell it not to 
multithread, but it still schedules jobsteps that cause the CPU to 
multithread


Is there another way of getting the non-multithreaded behavior without 
disabling multithreading in BIOS?


Best regards and many thanks in advance!
Hans van Schoot


Some additional information:
- I'm running slurm 18.08.4
- This is my node configuration in scontrol:
scontrol show nodes compute-7-0
NodeName=compute-7-0 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=0 CPUTot=32 CPULoad=1.00
   AvailableFeatures=rack-7,32CPUs
   ActiveFeatures=rack-7,32CPUs
   Gres=(null)
   NodeAddr=10.1.1.210 NodeHostName=compute-7-0 Version=18.08
   OS=Linux 6.1.8-1.el7.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jan 23 
12:57:27 EST 2023

   RealMemory=64051 AllocMem=0 FreeMem=62681 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=20511200 Owner=N/A 
MCS_label=N/A

   Partitions=zen4
   BootTime=2023-04-04T14:14:55 SlurmdStartTime=2023-04-17T12:32:38
   CfgTRES=cpu=32,mem=64051M,billing=47
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s



Re: [slurm-users] cgroups issue

2023-05-23 Thread Alan Orth
Dear Boris,

Do you really mean Ubuntu 14.04? I doubt that will work with modern SLURM
cgroups, even v1...

Regards,

On Tue, Mar 14, 2023 at 5:28 PM Boris Yazlovitsky 
wrote:

> I sent this a while ago - don't know if it got to the mailing list:
>
> I'm running slurm 23.02.0 on ubuntu 14.04
> when a batch job is submitted, getting this message in the error file:
>
> slurmstepd: error: common_file_write_content: unable to write 1 bytes to
> cgroup /sys/fs/cgroup/memory/slurm/uid_1000/memory.use_hierarchy: Device or
> resource busy
> slurmstepd: error: unable to set hierarchical accounting for
> /slurm/uid_1000
> slurmstepd: error: Could not find task_cpuacct_cg, this should never happen
> slurmstepd: error: Cannot get cgroup accounting data for 0
>
> This happens for both batch and interactive jobs.
>
> any pointers will be most appreciated.
>
> thanks!
> Boris
>


-- 
Alan Orth
alan.o...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch


Re: [slurm-users] Slurmd enabled crash with CgroupV2

2023-05-23 Thread Alan Orth
I notice the exact same behavior as Tristan. My CentOS Stream 8 system is
in full unified cgroupv2 mode, the slurmd.service has a "Delegate=Yes"
override added to it, and all cgroup stuff is added to slurm.conf and
cgroup.conf, yet slurmd does not start after reboot. I don't understand
what is happening, but I see the exact same behavior regarding the cgroup
subtree_control with disabling / re-enabling slurmd.

[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control


memory pids


[root@compute ~]# systemctl disable slurmd
Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
[root@compute ~]# systemctl enable slurmd
Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service
→ /usr/lib/systemd/system/slurmd.service.
[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids

After this slurmd starts up successfully (until the next reboot). We are on
version 22.05.9.

Regards,



On Fri, Mar 10, 2023 at 10:10 PM Brian Andrus  wrote:

> I'm not sure which specific item to look at, but this seems like a race
> condition.
> Likely you need to add an override to your slurmd startup
> (/etc/systemd/system/slurmd.service/override.conf) and put a dependency
> there so it won't start until that is done.
>
> I have mine wait for a few things:
>
> [Unit]
> After=autofs.service getty.target sssd.service
>
>
> That makes it wait for all of those before trying to start.
>
> Brian Andrus
> On 3/10/2023 7:41 AM, Tristan LEFEBVRE wrote:
>
> Hello to all,
>
> I'm trying to do an installation of Slurm with cgroupv2 activated.
>
> But I'm facing an odd thing : when slurmd is enabled it crash at the next
> reboot and will never start unless i disable it.
>
> Here is a full example of the situation
>
>
> [root@compute ~]# systemctl start slurmd
> [root@compute ~]# systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor 
> preset: disabled)
>Active: active (running) since Fri 2023-03-10 15:57:00 CET; 967ms ago
>  Main PID: 8053 (slurmd)
> Tasks: 1
>Memory: 3.1M
>CGroup: /system.slice/slurmd.service
>└─8053 /opt/slurm_bin/sbin/slurmd -D --conf-server X:6817 -s
>
> mars 10 15:57:00 compute.cluster.lab systemd[1]: Started Slurm node daemon.
> mars 10 15:57:00 compute.cluster.lab slurmd[8053]: slurmd: slurmd version 
> 23.02.0 started
> mars 10 15:57:00 compute.cluster.lab slurmd[8053]: slurmd: slurmd started on 
> Fri, 10 Mar 2023 15:57:00 +0100
> mars 10 15:57:00 compute.cluster.lab slurmd[8053]: slurmd: CPUs=48 Boards=1 
> Sockets=2 Cores=24 Threads=1 Memory=385311 TmpDisk=19990 Uptime=12>
>
> [root@compute ~]# systemctl enable slurmd
> Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service → 
> /usr/lib/systemd/system/slurmd.service.[root@compute ~]#  reboot now
>
> > [ reboot of the node]
>
> [adm@compute ~]$ sudo systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor 
> preset: disabled)
>Active: failed (Result: exit-code) since Fri 2023-03-10 16:00:33 CET; 1min 
> 0s ago
>   Process: 2659 ExecStart=/opt/slurm_bin/sbin/slurmd -D --conf-server 
> :6817 -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
>  Main PID: 2659 (code=exited, status=1/FAILURE)
> mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: slurmd version 
> 23.02.0 started
> mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: Controller 
> cpuset is not enabled!
> mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: Controller 
> cpu is not enabled!
> mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: cpu cgroup 
> controller is not available.
> mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: There's an 
> issue initializing memory or cpu controller
> mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: Couldn't 
> load specified plugin name for jobacct_gather/cgroup: Plugin init()>
> mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: error: cannot 
> create jobacct_gather context for jobacct_gather/cgroup
> mars 10 16:00:33 compute.cluster.lab slurmd[2659]: slurmd: fatal: Unable to 
> initialize jobacct_gather
> mars 10 16:00:33 compute.cluster.lab systemd[1]: slurmd.service: Main process 
> exited, code=exited, status=1/FAILURE
> mars 10 16:00:33 compute.cluster.lab systemd[1]: slurmd.service: Failed with 
> result 'exit-code'.
>
> [adm@compute ~]$ sudo systemctl start slurmd
> [adm@compute ~]$ sudo systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor 
> preset: disabled)
>Active: failed (Result: exit-code) since Fri 2023-03-10 16:01:37 CET; 1s 
> ago
>   Process: 3321 ExecStart=/opt/slurm_bin/sbin/slurmd -D --conf

[slurm-users] Weird scheduling behaviour

2023-05-23 Thread Badorreck, Holger
Hello,
I observe a weird behaviour of my SLURM installation (23.02.2). Some tasks take 
some hours to be scheduled (probably on one specific node), the pending state 
reason is "Resources", although resources are free.
I have tested a bit around and get this weird behaviour for salloc command:
"salloc --ntasks=4 --mem-per-cpu=3500M --gres=gpu:1" is waiting for resources, 
while
"salloc --ntasks=4 --mem-per-cpu=3700M --gres=gpu:1" is scheduled directly 
(while the command above is still waiting)

I have already restarted the slurmd daemon on that node and slurmctld, no 
changes to that behaviour.

This is the node configuration:

NodeName=node6 NodeHostname=cluster-node6 Port=17002 CPUs=64 RealMemory=254000 
Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:a10:3 Weight=2 
State=UNKNOWN

Gres.conf:
AutoDetect=off
Name=gpu Type=a10   File=/dev/nvidia0
Name=gpu Type=a10   File=/dev/nvidia1
Name=gpu Type=a10   File=/dev/nvidia2

What could be the issue here?

Regards,
Holger


Re: [slurm-users] [EXTERNAL] Re: Question about PMIX ERROR messages being emitted by some child of srun process

2023-05-23 Thread Pritchard Jr., Howard
Thanks Christopher,

This doesn't seem to be related to Open MPI at all except that for our 5.0.0 
and newer one has to use PMix to talk to the job launcher.
I built MPICH 4.1 on Perlmutter using the --with-pmix option and see a similar 
message from srun --mpi=pmix

hpp@nid008589:~/ompi/examples> (v5.0.x *)srun -u -n 2 --mpi=pmix ./hello_c
srun: Job 9369984 step creation temporarily disabled, retrying (Requested nodes 
are busy)
srun: Step created for StepId=9369984.2
[nid008589:104119] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c 
at line 750
[nid008593:11389] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c 
at line 750
Hello, world, I am 0 of 2, (MPICH Version:  4.1

I too noticed that if I set PMIX_DEBUG=1 the chatter from srun stops.  

Howard


On 5/22/23, 3:49 PM, "slurm-users on behalf of Christopher Samuel" 
mailto:slurm-users-boun...@lists.schedmd.com> on behalf of ch...@csamuel.org 
> wrote:


Hi Tommi, Howard,


On 5/22/23 12:16 am, Tommi Tervo wrote:


> 23.02.2 contains PMIx permission regression, it may be worth to check if it's 
> case?


I confirmed I could replicate the UNPACK-INADEQUATE-SPACE messages 
Howard is seeing on a test system, so I tried that patch on that same 
system without any change. :-(


Looking at the PMIx code base the messages appear to come from that code 
(the triggers are in src/mca/bfrops/) and I saw I could set 
PMIX_DEBUG=verbose to get more info on the problem, but when I set that 
these messages go away entirely. :-/


Very odd.


-- 
Chris Samuel : 
https://urldefense.com/v3/__http://www.csamuel.org/__;!!Bt8fGhp8LhKGRg!HEanFYm_RnpHRRRiPnt-564dlqBGqhwqAIL-Bxhnyx4ulsJP12Zc4ghc32V8Pb_-SYPXWQA5oFYyfZM$
 

 : Berkeley, CA, USA









Re: [slurm-users] [EXTERNAL] Re: Question about PMIX ERROR messages being emitted by some child of srun process

2023-05-23 Thread Christopher Samuel

On 5/23/23 10:33 am, Pritchard Jr., Howard wrote:


Thanks Christopher,


No worries!


This doesn't seem to be related to Open MPI at all except that for our 5.0.0 
and newer one has to use PMix to talk to the job launcher.
I built MPICH 4.1 on Perlmutter using the --with-pmix option and see a similar 
message from srun --mpi=pmix


That's right, these messages are coming from PMIx code rather than MPI.


I too noticed that if I set PMIX_DEBUG=1 the chatter from srun stops.


Yeah, it looks like setting PMIX_DEBUG to anything (I tried "hello") 
stops these messages from being emitted.


Slurm RPMs with that patch will go on to Perlmutter in the Thursday 
maintenance.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




[slurm-users] Task launch failure on cloud nodes (Address family '0' not supported)

2023-05-23 Thread Weaver, Christopher
I'm working on setting up a cloud partition, and running into some 
communications problems between my nodes. This looks like something I have 
misconfigured, or information I haven't correctly supplied to slurm, but the 
low-level nature of the error has made it hard for me to figure out what I've 
done wrong.

I have a batch script which is essentially:

#!/bin/sh
#SBATCH --time=2
#SBATCH --partition=cloud
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
srun -v --slurmd-debug=verbose singularity exec my-image.sif some 
args

Submitting this with `sbatch`, two 4 core VM nodes are started up as expected, 
the batch script is sent to one, and begins executing the `srun`. That seems to 
allocate the necessary job steps, but then fails when trying to communicate 
with the nodes in the allocation to start the tasks:

srun: jobid 320: nodes(2):`ec[0-1]', cpu counts: 4(x2)
srun: debug2: creating job with 8 tasks
srun: debug:  requesting job 320, user 1000, nodes 2 including ((null))
srun: debug:  cpus 8, tasks 8, name singularity, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none
srun: debug:  Entering _msg_thr_create()
srun: debug4: eio: handling events for 2 objects
srun: debug3: eio_message_socket_readable: shutdown 0 fd 9
srun: debug3: eio_message_socket_readable: shutdown 0 fd 5
srun: debug:  initialized stdio listening socket, port 43793
srun: debug:  Started IO server thread (139796182816512)
srun: debug:  Entering _launch_tasks
srun: debug3: IO thread pid = 1507
srun: debug4: eio: handling events for 4 objects
srun: debug2: Called _file_readable
srun: debug3:   false, all ioservers not yet initialized
srun: launching StepId=320.0 on host ec0, 4 tasks: [0-3]
srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 0
srun: launching StepId=320.0 on host ec1, 4 tasks: [4-7]
srun: debug3: uid:1000 gid:1000 cwd:/tmp/job-320 1
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so
srun: route/default: init: route default plugin loaded
srun: debug3: Success.
srun: debug3: Tree sending to ec0
srun: debug2: Tree head got back 0 looking for 2
srun: debug3: Tree sending to ec1
srun: error: slurm_get_port: Address family '0' not supported
srun: error: Error connecting, bad data: family = 0, port = 0
srun: debug3: problems with ec1
srun: error: slurm_get_port: Address family '0' not supported
srun: error: Error connecting, bad data: family = 0, port = 0
srun: debug3: problems with ec0
srun: debug2: Tree head got back 2
srun: debug:  launch returned msg_rc=1001 err=1001 type=9001
srun: error: Task launch for StepId=320.0 failed on node ec1: 
Communication connection failure
srun: debug:  launch returned msg_rc=1001 err=1001 type=9001
srun: error: Task launch for StepId=320.0 failed on node ec0: 
Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

It looks like the problem is inability to get correct addresses for the nodes 
in order to send data to them, but rather than a failure to translate the 
hostnames to addresses with DNS (which should work on these nodes), it appears 
that the slurm code in `srun` thinks it already has addresses, and attempts to 
use them even though they are in some uninitialized or partially initialized 
state (`ss_family` == 0).