[slurm-users] How to share GPU resources? (MPS or another way?)

2019-10-07 Thread Kota Tsuyuzaki
Hello guys, I'd like to ask the tips for GPU resource sharing with Slurm. I have multiple GPUs in my cluster and multiple users that spawn the jobs as the slurm batch job. However, the GPU resource usage is depending on what the job doing and unevenness so some jobs doesn't use GPU (a little of

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Bjørn-Helge Mevik
Jean-mathieu CHANTREIN writes: > I tried using, in slurm.conf > TaskPlugin=task/affinity, task/cgroup > SelectTypeParameters=CR_CPU_Memory > MemLimitEnforce=yes > > and in cgroup.conf: > CgroupAutomount=yes > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > MaxSwapPe

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Renfro, Michael
Our cgroup settings are quite a bit different, and we don’t allow jobs to swap, but the following works to limit memory here (I know, because I get emails frequent emails from users who don’t change their jobs from the default 2 GB per CPU that we use): CgroupMountpoint="/sys/fs/cgroup" CgroupA

[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Jean-mathieu CHANTREIN
Hello, I tried using, in slurm.conf TaskPlugin=task/affinity, task/cgroup SelectTypeParameters=CR_CPU_Memory MemLimitEnforce=yes and in cgroup.conf: CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes MaxSwapPercent=10 TaskAffinity=no But when the job

Re: [slurm-users] srun: Error generating job credential

2019-10-07 Thread Eddy Swan
Hi Marcus, I did not restarted munge previously. So I restarted munge and follow by slurmd, but the issue still persists. I ran the following test from piglet-17 to verify the munge installation, it looks good. $ munge -n | unmunge STATUS: Success (0) ENCODE_HOST: piglet-17.sg.cor

Re: [slurm-users] srun: Error generating job credential

2019-10-07 Thread Marcus Wagner
Hmm, that is strange. I asked because of the errors below: On 10/7/19 9:36 AM, Eddy Swan wrote: [2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid failed for uid=1000 [2019-10-07T13:38:49.260] error: slurm_cred_create error and "id" uses the same call (ltrace excerpt): getpwuid(0x9

Re: [slurm-users] srun: Error generating job credential

2019-10-07 Thread Eddy Swan
Hi Marcus, pilget-17 as submit host: $ id 1000 uid=1000(turing) gid=1000(turing) groups=1000(turing),10(wheel),991(vboxusers) piglet-18: $ id 1000 uid=1000(turing) gid=1000(turing) groups=1000(turing),10(wheel),992(vboxusers) id 1000 is a local user for each node (piglet-17~19). I also tried to