[slurm-users] Prevent CLOUD node from being shutdown after startup
Dear slurm-users, I am currently looking into options how I can deactivate suspending for nodes. I am both interested in the general case: Allowing all nodes to be powered up, but for all nodes without automatic suspending except when triggering power down manually. And the special case: Allowing all nodes to be powered up, but without automatic suspending for some nodes except when triggering power down manually. --- I tried using negative times for SuspendTime, but that didn't seem to work as no nodes are powered up then. Best regards, Xaver Stiensmeier
Re: [slurm-users] Prevent CLOUD node from being shutdown after startup
Xaver, Your descriptions of cases is a bit difficult to understand. It seems you want to have exceptions for power_up. That could be done by filtering the list of nodes yourself with any script/method you like and then do power_up on the remaining list. For excluding nodes from being suspended, there is an option in slurm.conf: SuspendExcNodes Specifies the nodes which are to not be placed in power save mode, even if the node remains idle for an extended period of time. Use Slurm's hostlist expression to identify nodes with an optional ":" separator and count of nodes to exclude from the preceding range. For example "nid[10-20]:4" will prevent 4 usable nodes (i.e IDLE and not DOWN, DRAINING or already powered down) in the set "nid[10-20]" from being powered down. Multiple sets of nodes can be specified with or without counts in a comma separated list (e.g "nid[10-20]:4,nid[80-90]:2"). By default no nodes are excluded. This value may be updated with scontrol. See ReconfigFlags=KeepPowerSaveSettings for setting persistence. Brian Andrus On 5/12/2023 2:35 AM, Xaver Stiensmeier wrote: Dear slurm-users, I am currently looking into options how I can deactivate suspending for nodes. I am both interested in the general case: Allowing all nodes to be powered up, but for all nodes without automatic suspending except when triggering power down manually. And the special case: Allowing all nodes to be powered up, but without automatic suspending for some nodes except when triggering power down manually. --- I tried using negative times for SuspendTime, but that didn't seem to work as no nodes are powered up then. Best regards, Xaver Stiensmeier
[slurm-users] Invalid device ordinal
Hello - I'm trying to get gpu container jobs working on virtual nodes. The jobs fail with "Test CUDA failure common.cu:893 'invalid device ordinal'" in the output file and "slurmstepd: error: mpi/pmix_v3: _errhandler: n4 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.126.0:1]" in the error file. Google points me to issues where others are selecting the wrong GPU or too many GPUs but I'm just trying to get one GPU (per node) working. Some infos: * slurm-22.05 * slurm-22.05.5-1.el9.x86_64 * slurm-contribs-22.05.5-1.el9.x86_64 * slurm-devel-22.05.5-1.el9.x86_64 * slurm-libpmi-22.05.5-1.el9.x86_64 * slurm-pam_slurm-22.05.5-1.el9.x86_64 * slurm-perlapi-22.05.5-1.el9.x86_64 * slurm-slurmctld-22.05.5-1.el9.x86_64 * slurm-example-configs-22.05.5-1.el9.x86_64 * nvslurm-plugin-pyxis-0.14.0-1.el9.x86_64 * Rocky Linux release 9.0 (Blue Onyx) * KVM virtualization * 6-node cluster n0 - n5. n4 and n5 have one Tesla V100-SXM2-16GB each. * Driver Version: 530.30.02 My attempt at setting this up: * Configure GresTypes=gpu in slurm.conf * Separate n4 and n5 in slurm.conf to use the GresType * NodeName=n[4-5] GRES=gpu:1 CPUs=3 State=UNKNOWN * Create /etc/slurm/gres.conf on each gpu node * Name=gpu File=/dev/nvidia0 * Sync slurm.conf across the cluster and restart slurmd on n[1-5] * Restart slurmctld on n0 * Resume n4 and n5 * scontrol update nodename=n[4-5] state=resume References: https://slurm.schedmd.com/gres.html, https://slurm.schedmd.com/gres.conf.html This little test script works and gives me gpu info: #!/bin/sh #SBATCH -J gpu_test #SBATCH -N 1 #SBATCH -n 3 #SBATCH -w n5 #SBATCH -o %j.o #SBATCH -e %j.e nvidia-smi nvidia-debugdump -l This script fails with the errors I mentioned above: #!/bin/sh #SBATCH -J tfmpi #SBATCH -N 2 #SBATCH -n 6 #SBATCH -w n[4-5] #SBATCH -o %j.o #SBATCH -e %j.e #SBATCH --gres=gpu:1 #SBATCH --gpus=1 srun --mpi=pmix --container-image=nvcr.io#nvidia/tensorflow:23.02-tf2-py3 all_reduce_perf_mpi -b 1G -e 1G -c 1 What am I missing to get the second script to run? Thank you. Cornelius Henderson Senior Systems Administrator NASA Center for Climate Simulation (NCCS) ASRC Federal InuTeq, LLC Goddard Space Flight Center Greenbelt, MD 20771