[slurm-users] Re: Wrong MaxRSS Behavior with cgroup v2 in Slurm

Guillaume COCHARD via slurm-users Thu, 22 May 2025 02:54:52 -0700

Hi,

No changes. My example used /tmp, but the behaviour is the same for copies 
between any filesystems (e.g. from a distributed fs to another distributed fs).


Guillaume

----- Mail original -----
De: "Stijn De Weirdt via slurm-users" <slurm-users@lists.schedmd.com>
À: slurm-users@lists.schedmd.com
Envoyé: Jeudi 22 Mai 2025 11:33:18
Objet: [slurm-users] Re: Wrong MaxRSS Behavior with cgroup v2 in Slurm

salut guillaume,

nothing else is different between the v1 and v2 setup? (/tmp is tmpfs on 
v2 setup perhaps?)


stijn

On 5/22/25 11:10, Guillaume COCHARD via slurm-users wrote:
> Hello,
> 
> We've noticed a recent change in how MaxRSS is reported on our cluster. 
> Specifically, the MaxRSS value for many jobs now often matches the allocated 
> memory, which was not the case previously. It appears this change is due to 
> how Slurm accounts for memory when copying large files, likely as a result of 
> moving from cgroup v1 to cgroup v2.
> 
> Here’s a simple example:
> 
> copy_file.sh
> #!/bin/bash
> cp /distributed/filesystem/file5G /tmp
> cp /tmp/file5G ~
> 
> Two jobs with different memory allocations:
> 
> Job 1
> sbatch -c 1 --mem=1G copy_file.sh
> seff <jobid>
> Memory Utilized: 1021.87 MB
> Memory Efficiency: 99.79% of 1.00 GB
> 
> Job 2
> sbatch -c 1 --mem=10G copy_file.sh
> seff <jobid>
> Memory Utilized: 4.02 GB
> Memory Efficiency: 40.21% of 10.00 GB
> 
> With cgroup v1, this script typically showed minimal memory usage. Now, under 
> cgroup v2, memory usage appears inflated and depends on the allocated memory, 
> which seems wrong.
> 
> I believe this behavior aligns with similar issues raised by the Kubernetes 
> community [1], and is consistent with how memory.current behaves in cgroup v2 
> [3].
> 
> According to Slurm’s documentation about cgroup v2, "this plugin provides 
> cgroup's memory.current value from the memory interface, which is not equal 
> to the RSS value provided by procfs. Nevertheless it is the same value that 
> the kernel uses in its OOM killer logic." [2]
> 
> While technically correct, this seems to mark a significant change in what 
> MaxRSS and "Memory Efficiency" actually measure and renders those metrics 
> almost useless.
> 
> Our Configuration:
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/cgroup,task/affinity
> 
> Question:
> Is there a way to restore more realistic MaxRSS values — specifically, ones 
> that exclude file-backed page cache — while still using cgroup v2?
> 
> Thanks,
> Guillaume
> 
> References:
> 
> [1] https://github.com/kubernetes/kubernetes/issues/118916
> [2] https://slurm.schedmd.com/cgroup_v2.html#limitations
> [3] https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html
> 


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Wrong MaxRSS Behavior with cgroup v2 in Slurm

Reply via email to