Hello guys,
I'd like to ask the tips for GPU resource sharing with Slurm. I have multiple
GPUs in my cluster and multiple users that spawn the
jobs as the slurm batch job. However, the GPU resource usage is depending on
what the job doing and unevenness so some jobs doesn't
use GPU (a little of
Jean-mathieu CHANTREIN writes:
> I tried using, in slurm.conf
> TaskPlugin=task/affinity, task/cgroup
> SelectTypeParameters=CR_CPU_Memory
> MemLimitEnforce=yes
>
> and in cgroup.conf:
> CgroupAutomount=yes
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> MaxSwapPe
Our cgroup settings are quite a bit different, and we don’t allow jobs to swap,
but the following works to limit memory here (I know, because I get emails
frequent emails from users who don’t change their jobs from the default 2 GB
per CPU that we use):
CgroupMountpoint="/sys/fs/cgroup"
CgroupA
Hello,
I tried using, in slurm.conf
TaskPlugin=task/affinity, task/cgroup
SelectTypeParameters=CR_CPU_Memory
MemLimitEnforce=yes
and in cgroup.conf:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MaxSwapPercent=10
TaskAffinity=no
But when the job
Hi Marcus,
I did not restarted munge previously.
So I restarted munge and follow by slurmd, but the issue still persists.
I ran the following test from piglet-17 to verify the munge installation,
it looks good.
$ munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: piglet-17.sg.cor
Hmm, that is strange. I asked because of the errors below:
On 10/7/19 9:36 AM, Eddy Swan wrote:
[2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid failed
for uid=1000
[2019-10-07T13:38:49.260] error: slurm_cred_create error
and "id" uses the same call (ltrace excerpt):
getpwuid(0x9
Hi Marcus,
pilget-17 as submit host:
$ id 1000
uid=1000(turing) gid=1000(turing)
groups=1000(turing),10(wheel),991(vboxusers)
piglet-18:
$ id 1000
uid=1000(turing) gid=1000(turing)
groups=1000(turing),10(wheel),992(vboxusers)
id 1000 is a local user for each node (piglet-17~19).
I also tried to