date:20191007

[slurm-users] How to share GPU resources? (MPS or another way?)

2019-10-07 Thread Kota Tsuyuzaki

Hello guys, I'd like to ask the tips for GPU resource sharing with Slurm. I have multiple GPUs in my cluster and multiple users that spawn the jobs as the slurm batch job. However, the GPU resource usage is depending on what the job doing and unevenness so some jobs doesn't use GPU (a little of

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Bjørn-Helge Mevik

Jean-mathieu CHANTREIN writes: > I tried using, in slurm.conf > TaskPlugin=task/affinity, task/cgroup > SelectTypeParameters=CR_CPU_Memory > MemLimitEnforce=yes > > and in cgroup.conf: > CgroupAutomount=yes > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > MaxSwapPe

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Renfro, Michael

Our cgroup settings are quite a bit different, and we don’t allow jobs to swap, but the following works to limit memory here (I know, because I get emails frequent emails from users who don’t change their jobs from the default 2 GB per CPU that we use): CgroupMountpoint="/sys/fs/cgroup" CgroupA

[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-07 Thread Jean-mathieu CHANTREIN

Hello, I tried using, in slurm.conf TaskPlugin=task/affinity, task/cgroup SelectTypeParameters=CR_CPU_Memory MemLimitEnforce=yes and in cgroup.conf: CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes MaxSwapPercent=10 TaskAffinity=no But when the job

Re: [slurm-users] srun: Error generating job credential

2019-10-07 Thread Eddy Swan

Hi Marcus, I did not restarted munge previously. So I restarted munge and follow by slurmd, but the issue still persists. I ran the following test from piglet-17 to verify the munge installation, it looks good. $ munge -n | unmunge STATUS: Success (0) ENCODE_HOST: piglet-17.sg.cor

Re: [slurm-users] srun: Error generating job credential

2019-10-07 Thread Marcus Wagner

Hmm, that is strange. I asked because of the errors below: On 10/7/19 9:36 AM, Eddy Swan wrote: [2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid failed for uid=1000 [2019-10-07T13:38:49.260] error: slurm_cred_create error and "id" uses the same call (ltrace excerpt): getpwuid(0x9

Re: [slurm-users] srun: Error generating job credential

2019-10-07 Thread Eddy Swan

Hi Marcus, pilget-17 as submit host: $ id 1000 uid=1000(turing) gid=1000(turing) groups=1000(turing),10(wheel),991(vboxusers) piglet-18: $ id 1000 uid=1000(turing) gid=1000(turing) groups=1000(turing),10(wheel),992(vboxusers) id 1000 is a local user for each node (piglet-17~19). I also tried to

[slurm-users] How to share GPU resources? (MPS or another way?)

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

[slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Re: [slurm-users] srun: Error generating job credential

Re: [slurm-users] srun: Error generating job credential

Re: [slurm-users] srun: Error generating job credential

7 matches

Site Navigation

Mail list logo

Footer information