[slurm-users] GPU Accounting

2024-10-02 Thread Emyr James via slurm-users
We have a node with 8 H100 GPUs that are split into MIG instances. We are using cgroups. This seems to work fine. Users can do something like sbatch --gres="gpu:1g.10gb:1"... and the job starts on the node with the gpus and cuda visible devices and the pytorch debug shows that the cgroup only g

[slurm-users] Re: Job Step State

2024-10-01 Thread Emyr James via slurm-users
? Regards, Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation From: Emyr James via slurm-users Sent: 12 July 2024 11:51 To: slurm-users@lists.schedmd.com Subject: [slurm-users] Job Step State Dear all, I am working on a script to take

[slurm-users] Re: Nodes TRES double what is requested

2024-07-12 Thread Emyr James via slurm-users
Not sure if this is correct but I think you need to leave a bit of RAM for the OS to use so best not to allow slurm to allocate ALL of it. I usually take 8G off to allow for that - negligible when our nodes have at least 768GB of RAM. At least this is my experience when using cgroups. Emyr Jame

[slurm-users] Job Step State

2024-07-12 Thread Emyr James via slurm-users
Dear all, I am working on a script to take completed job accounting data from the slurm accounting database and insert the equivalent data into a clickhouse table for fast reporting I can see that all the information is included in the cluster_job_table and cluster_job_step_table which seem to

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
? Presumably non-cgroup accounting has a similar issue ? I.e. it polls rss and then the accounting db reports the highest seen even though using getrusage and checking ru_maxrss should be done too ? Many thanks, Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation __

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
for Genomic Regulation From: Emyr James via slurm-users Sent: 20 May 2024 13:56 To: Thomas Green - Staff in University IT, Research Technologies / Staff Technoleg Gwybodaeth, Technolegau Ymchwil ; Davide DelVento Cc: slurm-users@lists.schedmd.com Subject: [slur

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
.g. https://github.com/google/cadvisor/issues/3286<https://urldefense.com/v3/__https://github.com/google/cadvisor/issues/3286__;!!D9dNQwwGXtA!UW9JUyJ5ByL6XxihSUX-hn_HC2rYL-BZ8HtbdSlP10hGha71tuIHFmUOQ7dPpEseh3Ecyo-rrPUDVWPKJ280u9w$> Tom From: Emyr James via slurm-users Date: Monday, 2

[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Emyr James via slurm-users
ation available on this functionality ? Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation From: Emyr James via slurm-users Sent: 17 May 2024 01:15 To: Davide DelVento Cc: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: memory high

[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Emyr James via slurm-users
(which also uses getrusage) or a variant you will be able to do that. On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users mailto:slurm-users@lists.schedmd.com>> wrote: Hi, We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memor

[slurm-users] memory high water mark reporting

2024-05-16 Thread Emyr James via slurm-users
Hi, We are trying out slurm having been running grid engine for a long while. In grid engine, the cgroups peak memory and max_rss are generated at the end of a job and recorded. It logs the information from the cgroup hierarchy as well as doing a getrusage call right at the end on the parent pid