Re: [slurm-users] Usage gathering for GPUs

2023-06-06 Thread Vecerka Daniel

Hi all,

 I'm trying to get working the gathering of gres/gpumem and 
gres/gpuutil on Slurm 23.02.2 , but with no success yet.


We have:
AccountingStorageTRES=cpu,mem,gres/gpu
in the slurm.conf and Slurm is build with NVML support.

Autodetect=NVML
in gres.conf

gres/gpumem and gres/gpuutil now appears in sacct TRESUsageInAve record, 
but with zero values:


sacct -j 6056927_51 -Pno TRESUsageInAve

cpu=00:00:07,energy=0,fs/disk=14073059,gres/gpumem=0,gres/gpuutil=0,mem=6456K,pages=0,vmem=7052K
cpu=00:00:00,energy=0,fs/disk=2332,gres/gpumem=0,gres/gpuutil=0,mem=44K,pages=0,vmem=44K
cpu=05:18:51,energy=0,fs/disk=708800,gres/gpumem=0,gres/gpuutil=0,mem=2565376K,pages=0,vmem=2961244K

We are using NVIDIA Tesla V100 and A100 GPUs with driver version 
530.30.02. dcgm-exporter is working on the nodes.


Is there anything else needed, to get it working?

Thanks in advanced.    Daniel Vecerka


On 24. 05. 23 21:45, Christopher Samuel wrote:

On 5/24/23 11:39 am, Fulton, Ben wrote:


Hi,


Hi Ben,

The release notes for 23.02 say “Added usage gathering for gpu/nvml 
(Nvidia) and gpu/rsmi (AMD) plugins”.


How would I go about enabling this?


I can only comment on the nvidia side (as those are the GPUs we have) 
but for that you need Slurm built with NVML support and running with 
"Autodetect=NVML" in gres.conf and then that information is stored in 
slurmdbd as part of the TRES usage data.


For example to grab a job step for a test code I ran the other day:

csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve | 
tr , \\n | fgrep gpu

gres/gpumem=493120K
gres/gpuutil=76

Hope that helps!

All the best,
Chris


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] Moving Job form local to remote Cluster

2023-06-06 Thread Shaghuf Rahman
Hi,

Greetings of the day!

Need your suggestions on the below use cases.

I have 2 Slurm clusters pointing to the same database server. I'm
submitting a job in a local cluster and once my local cluster resources get
full I want to move the pending jobs to my remote cluster. Is there any way
to achieve this?
My intention is that instead of the user submitting the job using -M option
I want to do it on their behalf only for the Pending jobs to be moved to my
remote cluster. Assuming the remote server is on cloud, and I don't want
users to submit the job directly and exhaust the cloud resources even
though my local cluster resources are not full.

I have not tried but does it solve my problem if i keep the database
separate for both the clusters.

Please advise some suggestions or methods to achieve by some modifications.

Thanks,
Shaghuf


[slurm-users] Cloud node utilization reporting

2023-06-06 Thread Chip Seraphine
Hello,

I’ve got a problem that I’d imagine others have as well and am wondering how it 
is handled.

I produce periodic reports for my management showing, among other things, the 
overall “cluster utilization”, which we define as basically the ratio of 
CPU*Minutes allocated to CPU*Minutes available.   It’s a simplistic but handy 
metric for projecting growth, among other things.

Currently I grab this by running “sreport cluster utilization” and dividing 
“allocated” by “allocated + idle”, which gives us a pretty reasonable number.  
However, we recently added some cloud-based partitions.  I was hoping that idle 
nodes with state=CLOUD would not show up in this sreport output, but 
unfortunately they do. Our cloud partitions are almost never used (they are 
essentially for emergencies), but because they are quite large it has dropped 
the computed utilization enormously.   Management is really only interested in 
the utilization of our on-prem components.

I can kludge this by manually subtracting out the ( (number of CPUs in all 
cloud partitions) * (number of minutes in the reporting period) ), but that 
would require me to determine and add back in all allocated minutes for cloud 
jobs, keep track of intra-day changes to the partition sizes, etc.

Are others encountering similar problems?   And if so, how do you resolve them?

--

Chip Seraphine
Grid Operations
For support please use help-grid in email or slack.
This e-mail and any attachments may contain information that is confidential 
and proprietary and otherwise protected from disclosure. If you are not the 
intended recipient of this e-mail, do not read, duplicate or redistribute it by 
any means. Please immediately delete it and any attachments and notify the 
sender that you have received it by mistake. Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail or any 
attachments. The DRW Companies make no representations that this e-mail or any 
attachments are free of computer viruses or other defects.


[slurm-users] Trying to update from slurm 19.05 to slurm 23.02 but I can't figure out how to allow users to reboot nodes...

2023-06-06 Thread Heinz, Michael
I recently took over running a slurm cluster that among other things allows 
users to reboot nodes in order to have them re-configured for different kinds 
of tests. This is accomplished through the RebootProgram config setting.

But in the test environment I set up we seem to have lost that capability:

[dlab04:~](N/A)$ srun -w skl01 -p sklcvl --reboot --pty /bin/bash
srun: error: rebooting of nodes is only allowed for admins
srun: error: Unable to allocate resources: Access/permission denied

I've gone through the man pages for slurm.conf but I can't find anything about 
how to define who the admins are? Is there still a way to do this with slurm or 
has the ability been removed?

Michael Heinz
End-to-End Network Software Engineer
michael.he...@intel.com




Re: [slurm-users] Trying to update from slurm 19.05 to slurm 23.02 but I can't figure out how to allow users to reboot nodes...

2023-06-06 Thread Christopher Samuel

On 6/6/23 1:33 pm, Heinz, Michael wrote:


I've gone through the man pages for slurm.conf but I can't find anything about 
how to define who the admins are? Is there still a way to do this with slurm or 
has the ability been removed?


Looks like that was disabled over 3 years ago.

commit dd111a52bf23d79efcfe9d5688e15cbc768bb22b
Author: Brian Christiansen 
Date:   Fri Jan 31 14:24:40 2020 -0700

Disable sbatch, salloc, srun --reboot for non-admins

Bug 7767

That bug is private it seems.

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Trying to update from slurm 19.05 to slurm 23.02 but I can't figure out how to allow users to reboot nodes...

2023-06-06 Thread Heinz, Michael
Yeah, looks like if we still want to do this I need to set up slurmdbd and an 
account database.

Sent from my iPad

> On Jun 6, 2023, at 5:07 PM, Christopher Samuel  wrote:
> 
> On 6/6/23 1:33 pm, Heinz, Michael wrote:
> 
>> I've gone through the man pages for slurm.conf but I can't find anything 
>> about how to define who the admins are? Is there still a way to do this with 
>> slurm or has the ability been removed?
> 
> Looks like that was disabled over 3 years ago.
> 
> commit dd111a52bf23d79efcfe9d5688e15cbc768bb22b
> Author: Brian Christiansen 
> Date:   Fri Jan 31 14:24:40 2020 -0700
> 
>Disable sbatch, salloc, srun --reboot for non-admins
> 
>Bug 7767
> 
> That bug is private it seems.
> 
> All the best,
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> 
> 


[slurm-users] Billing/accounting for MIGs is not working

2023-06-06 Thread Richard Lefebvre
We have MIG defined and being used. But the billing for which MIG is used
dean't seem to work.

I have in the partitions the slurm.conf with something like below for
TRESBilllings:
TRESBillingWeights=CPU=1,Mem=1G,GRES/gpu:3g.20gb=0.375,GRES/gpu:4g.20gb=0.5,GRES/gpu=1.0

Yet, when I do sacct -j  I don't see that I use 3g.20gb or
4g.20gb MIG, I only see AllocTRES with gres/gpu=1. I cannot see a smaller
billing for the GPU.