Re: [slurm-users] Usage gathering for GPUs
Hi all, I'm trying to get working the gathering of gres/gpumem and gres/gpuutil on Slurm 23.02.2 , but with no success yet. We have: AccountingStorageTRES=cpu,mem,gres/gpu in the slurm.conf and Slurm is build with NVML support. Autodetect=NVML in gres.conf gres/gpumem and gres/gpuutil now appears in sacct TRESUsageInAve record, but with zero values: sacct -j 6056927_51 -Pno TRESUsageInAve cpu=00:00:07,energy=0,fs/disk=14073059,gres/gpumem=0,gres/gpuutil=0,mem=6456K,pages=0,vmem=7052K cpu=00:00:00,energy=0,fs/disk=2332,gres/gpumem=0,gres/gpuutil=0,mem=44K,pages=0,vmem=44K cpu=05:18:51,energy=0,fs/disk=708800,gres/gpumem=0,gres/gpuutil=0,mem=2565376K,pages=0,vmem=2961244K We are using NVIDIA Tesla V100 and A100 GPUs with driver version 530.30.02. dcgm-exporter is working on the nodes. Is there anything else needed, to get it working? Thanks in advanced. Daniel Vecerka On 24. 05. 23 21:45, Christopher Samuel wrote: On 5/24/23 11:39 am, Fulton, Ben wrote: Hi, Hi Ben, The release notes for 23.02 say “Added usage gathering for gpu/nvml (Nvidia) and gpu/rsmi (AMD) plugins”. How would I go about enabling this? I can only comment on the nvidia side (as those are the GPUs we have) but for that you need Slurm built with NVML support and running with "Autodetect=NVML" in gres.conf and then that information is stored in slurmdbd as part of the TRES usage data. For example to grab a job step for a test code I ran the other day: csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve | tr , \\n | fgrep gpu gres/gpumem=493120K gres/gpuutil=76 Hope that helps! All the best, Chris smime.p7s Description: S/MIME Cryptographic Signature
[slurm-users] Moving Job form local to remote Cluster
Hi, Greetings of the day! Need your suggestions on the below use cases. I have 2 Slurm clusters pointing to the same database server. I'm submitting a job in a local cluster and once my local cluster resources get full I want to move the pending jobs to my remote cluster. Is there any way to achieve this? My intention is that instead of the user submitting the job using -M option I want to do it on their behalf only for the Pending jobs to be moved to my remote cluster. Assuming the remote server is on cloud, and I don't want users to submit the job directly and exhaust the cloud resources even though my local cluster resources are not full. I have not tried but does it solve my problem if i keep the database separate for both the clusters. Please advise some suggestions or methods to achieve by some modifications. Thanks, Shaghuf
[slurm-users] Cloud node utilization reporting
Hello, I’ve got a problem that I’d imagine others have as well and am wondering how it is handled. I produce periodic reports for my management showing, among other things, the overall “cluster utilization”, which we define as basically the ratio of CPU*Minutes allocated to CPU*Minutes available. It’s a simplistic but handy metric for projecting growth, among other things. Currently I grab this by running “sreport cluster utilization” and dividing “allocated” by “allocated + idle”, which gives us a pretty reasonable number. However, we recently added some cloud-based partitions. I was hoping that idle nodes with state=CLOUD would not show up in this sreport output, but unfortunately they do. Our cloud partitions are almost never used (they are essentially for emergencies), but because they are quite large it has dropped the computed utilization enormously. Management is really only interested in the utilization of our on-prem components. I can kludge this by manually subtracting out the ( (number of CPUs in all cloud partitions) * (number of minutes in the reporting period) ), but that would require me to determine and add back in all allocated minutes for cloud jobs, keep track of intra-day changes to the partition sizes, etc. Are others encountering similar problems? And if so, how do you resolve them? -- Chip Seraphine Grid Operations For support please use help-grid in email or slack. This e-mail and any attachments may contain information that is confidential and proprietary and otherwise protected from disclosure. If you are not the intended recipient of this e-mail, do not read, duplicate or redistribute it by any means. Please immediately delete it and any attachments and notify the sender that you have received it by mistake. Unintended recipients are prohibited from taking action on the basis of information in this e-mail or any attachments. The DRW Companies make no representations that this e-mail or any attachments are free of computer viruses or other defects.
[slurm-users] Trying to update from slurm 19.05 to slurm 23.02 but I can't figure out how to allow users to reboot nodes...
I recently took over running a slurm cluster that among other things allows users to reboot nodes in order to have them re-configured for different kinds of tests. This is accomplished through the RebootProgram config setting. But in the test environment I set up we seem to have lost that capability: [dlab04:~](N/A)$ srun -w skl01 -p sklcvl --reboot --pty /bin/bash srun: error: rebooting of nodes is only allowed for admins srun: error: Unable to allocate resources: Access/permission denied I've gone through the man pages for slurm.conf but I can't find anything about how to define who the admins are? Is there still a way to do this with slurm or has the ability been removed? Michael Heinz End-to-End Network Software Engineer michael.he...@intel.com
Re: [slurm-users] Trying to update from slurm 19.05 to slurm 23.02 but I can't figure out how to allow users to reboot nodes...
On 6/6/23 1:33 pm, Heinz, Michael wrote: I've gone through the man pages for slurm.conf but I can't find anything about how to define who the admins are? Is there still a way to do this with slurm or has the ability been removed? Looks like that was disabled over 3 years ago. commit dd111a52bf23d79efcfe9d5688e15cbc768bb22b Author: Brian Christiansen Date: Fri Jan 31 14:24:40 2020 -0700 Disable sbatch, salloc, srun --reboot for non-admins Bug 7767 That bug is private it seems. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Trying to update from slurm 19.05 to slurm 23.02 but I can't figure out how to allow users to reboot nodes...
Yeah, looks like if we still want to do this I need to set up slurmdbd and an account database. Sent from my iPad > On Jun 6, 2023, at 5:07 PM, Christopher Samuel wrote: > > On 6/6/23 1:33 pm, Heinz, Michael wrote: > >> I've gone through the man pages for slurm.conf but I can't find anything >> about how to define who the admins are? Is there still a way to do this with >> slurm or has the ability been removed? > > Looks like that was disabled over 3 years ago. > > commit dd111a52bf23d79efcfe9d5688e15cbc768bb22b > Author: Brian Christiansen > Date: Fri Jan 31 14:24:40 2020 -0700 > >Disable sbatch, salloc, srun --reboot for non-admins > >Bug 7767 > > That bug is private it seems. > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > >
[slurm-users] Billing/accounting for MIGs is not working
We have MIG defined and being used. But the billing for which MIG is used dean't seem to work. I have in the partitions the slurm.conf with something like below for TRESBilllings: TRESBillingWeights=CPU=1,Mem=1G,GRES/gpu:3g.20gb=0.375,GRES/gpu:4g.20gb=0.5,GRES/gpu=1.0 Yet, when I do sacct -j I don't see that I use 3g.20gb or 4g.20gb MIG, I only see AllocTRES with gres/gpu=1. I cannot see a smaller billing for the GPU.