[slurm-users] Slurm PID Files

2024-11-20 Thread Matthias Leopold via slurm-users
Hi, I compiled and installed Slurm 24.05 on Ubuntu 22.04 following this tutorial: https://www.schedmd.com/slurm/installation-tutorial/ Systemd service files are from deb packages that result from this. Do I have to worry that slurmctld and slurmd don't write PID files although SlurmctldPidFil

[slurm-users] Re: [EXTERN] Re: Slurm and NVIDIA NVML

2024-11-13 Thread Matthias Leopold via slurm-users
@altoslabs.com> On Wed, Nov 13, 2024 at 10:21 AM Matthias Leopold via slurm-users mailto:slurm-users@lists.schedmd.com>> wrote: Hi, I'm trying to compile Slurm with NVIDIA NVML support, but the result is unexpected. I get /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so,

[slurm-users] Slurm and NVIDIA NVML

2024-11-13 Thread Matthias Leopold via slurm-users
Hi, I'm trying to compile Slurm with NVIDIA NVML support, but the result is unexpected. I get /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so, but when I do "ldd /usr/lib/x86_64-linux-gnu/slurm/gpu_nvml.so" there is no reference to /lib/x86_64-linux-gnu/libnvidia-ml.so.1 (which I would expect).

[slurm-users] slurmdbd 17.02: "cluster not registered" (but things work)

2024-02-19 Thread Matthias Leopold via slurm-users
Hi, I need to take care of a 17.02 Slurm cluster (I'm preparing it for upgrades). I see that slurmdbd logs various "cluster not registered" messages at startup (DBD_CLUSTER_TRES,DBD_JOB_START,DBD_STEP_START), but I don't see a real problem. Accounting works. Do I have to worry? Can this be re

[slurm-users] Reconfigure Gres for Node online?

2023-12-07 Thread Matthias Leopold
Hi, I want to change Gres definition for a Node from NodeName=s0-n10 Gres=gpu:a100:5 to NodeName=s0-n10 Gres=gpu:a100-sxm4-80gb:5 -> HW stays the same, only Gres name changes, a100-sxm4-80gb is already defined in Cluster When I do this online will this affect running jobs on the Node? Slur

[slurm-users] Slurm + NVIDIA H100 + NVML Version

2023-08-21 Thread Matthias Leopold
Hi, not sure if this is the right place: Our Slurm 21.08 is compiled against NVML from CUDA 11.4 for "AutoDetect=nvml" support in gres.conf. Currently we use A100 GPU, I would like to know if we could use H100 GPU with this setup or if I need newer NVML (what version?). I didn't find anything

Re: [slurm-users] AllowGroups for Partition not working?

2023-07-06 Thread Matthias Leopold
On 05/07/2023 17:17, Matthias Leopold wrote: Thanks, but unfortunately that didn't help. Regards, Matthias Am 05.07.23 um 17:59 schrieb Marko Markoc: Hi Matthias, Before you start digging deeper into this, I would recommend restarting the `slurmctld` service. I've had simila

Re: [slurm-users] AllowGroups for Partition not working?

2023-07-05 Thread Matthias Leopold
27;t enough for certain configuration changes. Regards, Marko On Tue, Jul 4, 2023 at 3:57 AM Matthias Leopold <mailto:matthias.leop...@meduniwien.ac.at>> wrote: Hi, I'm trying to use AllowGroups for partition configuration in my Slurm 21.08 cluster. Unexpectedly this

[slurm-users] AllowGroups for Partition not working?

2023-07-04 Thread Matthias Leopold
Hi, I'm trying to use AllowGroups for partition configuration in my Slurm 21.08 cluster. Unexpectedly this doesn't seem to work. My user can't submit jobs although he is member of group mentioned in AllowGroups: srun: error: Unable to allocate resources: User's group not permitted to use thi

Re: [slurm-users] Kernel keyrings on Slurm node inside Slurm job

2022-08-25 Thread Matthias Leopold
|The Rachel and Selim Benin School [] /\ |of Computer Science and Engineering []//\\/ |The Hebrew University of Jerusalem [// \\ |T +972-2-5494522 | F +972-2-5494522 // \ |ir...@cs.huji.ac.il <mailto:ir...@cs.huji.ac.il> // | -- Matthias Leopold

[slurm-users] Kernel keyrings on Slurm node inside Slurm job

2022-08-23 Thread Matthias Leopold
Hi, I want to access the kernel "user" keyrings inside a Slurm job on a Ubuntu 20.04 node. I'm not an expert on keyrings (yet), I just discovered that inside a Slurm job a keyring for "user: invocation_id" is used, which seems to be shared across all users of the executing Slurm node (other u

[slurm-users] seff for NVIDIA GPU usage?

2022-06-07 Thread Matthias Leopold
Hi, I know this might be a too simple question for a bigger topic, but I'll just try: is there something like seff for measuring the efficiency of NVIDIA GPU usage in Slurm jobs? thx Matthias

Re: [slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm - solved

2022-01-31 Thread Matthias Leopold
ives me everything I want, sorry for bothering you. Matthias Am 27.01.22 um 16:27 schrieb Matthias Leopold: Hi, we have 2 DGX A100 systems which we would like to use with Slurm. We want to use the MIG feature for _some_ of the GPUs. As I somehow suspected I couldn't find a working setup

Re: [slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm - within one node

2022-01-27 Thread Matthias Leopold
devices. But there are downsides like no multi node MPI jobs and in general I still can't believe there is such a limitation. thx again for any feedback Matthias Am 27.01.22 um 16:27 schrieb Matthias Leopold: Hi, we have 2 DGX A100 systems which we would like to use with Slurm. We want t

[slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm

2022-01-27 Thread Matthias Leopold
Hi, we have 2 DGX A100 systems which we would like to use with Slurm. We want to use the MIG feature for _some_ of the GPUs. As I somehow suspected I couldn't find a working setup for this in Slurm yet. I'll describe the configuration variants I tried after creating the MIG instances, it migh

Re: [slurm-users] Building Slurm with UCX support

2022-01-12 Thread Matthias Leopold
Am 12.01.22 um 17:54 schrieb Matthias Leopold: Hi, I'm compiling Slurm with ansible playbooks from NVIDIA deepops framework (https://github.com/NVIDIA/deepops). I'm trying to add UCX support. How can I tell if UCX is actually included in the resulting binaries (without actu

[slurm-users] Building Slurm with UCX support

2022-01-12 Thread Matthias Leopold
Hi, I'm compiling Slurm with ansible playbooks from NVIDIA deepops framework (https://github.com/NVIDIA/deepops). I'm trying to add UCX support. How can I tell if UCX is actually included in the resulting binaries (without actually using Slurm)? I was looking at executables and *so files with

[slurm-users] Specific limits over GRES - still relevant?

2021-07-01 Thread Matthias Leopold
Hi, I'm trying to prepare for using Slurm with DGX A100 systems with MIG configuration. I will have several gres:gpu types there so I tried to reproduce the situation described in "Specific limits over GRES" from https://slurm.schedmd.com/resource_limits.html, but I can't. In my test environ

Re: [slurm-users] limiting memory usage when submission doesn't specify memory requirements?

2021-04-23 Thread Matthias Leopold
at is expected behavior, but it would keep you from having to do something with a plugin. Jeff *From:* slurm-users on behalf of Matthias Leopold *Sent:* Thursday, April 22, 2021 5:13 AM *To:* Slurm User Community List *Su

[slurm-users] limiting memory usage when submission doesn't specify memory requirements?

2021-04-22 Thread Matthias Leopold
Hi, I'm testing how limiting memory resources works in Slurm. I'm using TaskPlugin=affinity,cgroup (slurm.conf) and ConstrainRAMSpace=yes (cgroup.conf) and have set a MaxMemPerCPU limit on the partition. To my surprise MaxMemPerCPU is enforced as long as the job submission requests a memory li

[slurm-users] Grp* Resource Limits on User Associations

2021-04-16 Thread Matthias Leopold
Hi, can someone please explain to me why it's possible to set Grp* resource limits on user associations? What's the use for this? As far as I understood documentation accounts can have children, but not users. I'm still a newbie exploring Slurm in a test environment, please excuse maybe stup

Re: [slurm-users] RawUsage 0??

2021-04-07 Thread Matthias Leopold
I had to do it and had no hints) Sorry for bothering you Matthias Am 06.04.21 um 17:06 schrieb Matthias Leopold: Hi, I'm very new to Slurm and try to understand basic concepts. One of them is the "Multifactor Priority Plugin". For this I submitted some jobs and looked at ss

[slurm-users] RawUsage 0??

2021-04-06 Thread Matthias Leopold
Hi, I'm very new to Slurm and try to understand basic concepts. One of them is the "Multifactor Priority Plugin". For this I submitted some jobs and looked at sshare output. To my surprise I don't get any numbers for "RawUsage", regardless what I do RawUsage stays 0 (same in "scontrol show as