Re: [slurm-users] Running job using our serial queue

2019-11-07 Thread David Baker
Hi Marcus, Thank you for your reply. Your comments regarding the oom_killer sounds interesting. Looking at the slurmd logs on the serial nodes I see that the oom_killer is very active on a typical day, and so I suspect you're likely on to something there. As you might expect memory is configure

[slurm-users] oom-kill events for no good reason

2019-11-07 Thread David Baker
Hello, We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer look this afternoon I noticed that

Re: [slurm-users] oom-kill events for no good reason

2019-11-07 Thread Christopher Samuel
On 11/7/19 8:36 AM, David Baker wrote: We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer lo

[slurm-users] Scheduling GPUS

2019-11-07 Thread Mike Mosley
Greetings all: I'm attempting to configure the scheduler to schedule our GPU boxes but have run into a bit of a snag. I have a box with two Tesla K80s. With my current configuration, the scheduler will schedule one job on the box, but if I submit a second job, it queues up until the first one f

Re: [slurm-users] Scheduling GPUS

2019-11-07 Thread Alex Chekholko
Hi Mike, IIRC if you have the default config, jobs get all the memory in the node, thus you can only run one job at a time. Check: root@admin:~# scontrol show config | grep DefMemPerNode DefMemPerNode = 64000 Regards, Alex On Thu, Nov 7, 2019 at 1:21 PM Mike Mosley wrote: > Greetings

[slurm-users] Limiting the number of CPU

2019-11-07 Thread Sukman
Hi all, I am currently having a problem in limiting the number of CPU used for running a job. I tried to limit the CPU to just only 2 from the maximum 56. But, when I run the job, using only 1 CPU, the QOS has been reached already. When I set the CPU to 56, the job runs finely. Does anyone have