Hi Marcus,
Thank you for your reply. Your comments regarding the oom_killer sounds
interesting. Looking at the slurmd logs on the serial nodes I see that the
oom_killer is very active on a typical day, and so I suspect you're likely on
to something there. As you might expect memory is configure
Hello,
We are dealing with some weird issue on our shared nodes where job appear to be
stalling for some reason. I was advised that this issue might be related to the
oom-killer process. We do see a lot of these events. In fact when I started to
take a closer look this afternoon I noticed that
On 11/7/19 8:36 AM, David Baker wrote:
We are dealing with some weird issue on our shared nodes where job
appear to be stalling for some reason. I was advised that this issue
might be related to the oom-killer process. We do see a lot of these
events. In fact when I started to take a closer lo
Greetings all:
I'm attempting to configure the scheduler to schedule our GPU boxes but
have run into a bit of a snag.
I have a box with two Tesla K80s. With my current configuration, the
scheduler will schedule one job on the box, but if I submit a second job,
it queues up until the first one f
Hi Mike,
IIRC if you have the default config, jobs get all the memory in the node,
thus you can only run one job at a time. Check:
root@admin:~# scontrol show config | grep DefMemPerNode
DefMemPerNode = 64000
Regards,
Alex
On Thu, Nov 7, 2019 at 1:21 PM Mike Mosley wrote:
> Greetings
Hi all,
I am currently having a problem in limiting the number of CPU used for running
a job.
I tried to limit the CPU to just only 2 from the maximum 56.
But, when I run the job, using only 1 CPU, the QOS has been reached already.
When I set the CPU to 56, the job runs finely.
Does anyone have