[slurm-users] Need to free up memory for running more than one job on a node

2023-06-16 Thread Joe Waliga

Hello,

(This is my first time submitting a question to the list)

We have a test-HPC with 1 login node and 2 computer nodes. When we 
submit 90 jobs onto the test-HPC, we can only run one job per node. We 
seem to be allocating all memory to the one job and other jobs can run 
until the memory is freed up.


Any ideas on what we need to change inorder to free up the memory?

~ ~

We noticed this from the 'slurmctld.log' ...

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp01, allocated memory = 1 and all memory 
requested for JobId=71_*
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp02, allocated memory = 1 and all memory 
requested for JobId=71_*


The test-HPC is running on hardware, but we also created a test-HPC 
using a 3 VM set constructed by Vagrant running on a Virtualbox backend.


I have included some of the 'slurmctld.log' file, the batch submission 
script, the slurm.conf file (of the hardware based test-HPC), and the 
'Vagrantfile' file (in case someone wants to recreate our test-HPC in a 
set of VMs.)


- Joe


- (some of) slurmctld.log -

[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: 
Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: 
Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp01, allocated memory = 1 and all memory 
requested for JobId=71_7(71)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: 
Not considering node hpc2-comp02, allocated memory = 1 and all memory 
requested for JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: 
evaluating JobId=71_7(71) on 0 nodes
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 
0 fail: insufficient resources
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no 
job_resources info for JobId=71_7(71) rc=-1
[2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: 
evaluating JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: 
JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & 
exc_cores
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: 
min:1 max:1 requested:1 avail:2
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: 
Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: 
Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 
ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 
AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: 
evaluating JobId=71_7(71) on 2 nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/enter
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: 
SELECT_TYPE: 32 CPUs on hpc2-comp01(state:1), mem 1/1
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 
ThreadsPerCore:2
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
  Socket[0] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
  Socket[1] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: 
SELECT_TYPE: 32 CPUs on hpc2-comp02(state:1), mem 1/1
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 
ThreadsPerCore:2
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
  Socket[0] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
  Socket[1] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/elim_nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec 
CPUs:64 nodes:2:hpc2-comp[01-02] begin:

Re: [slurm-users] Need to free up memory for running more than one job on a node

2023-06-16 Thread Markuske, William
Hello Joe,

You haven't defined any memory allocation or oversubscription in your 
slurm.conf so by default it is giving a full node's worth of memory to each 
job. There are multiple options that you can do but what you probably want to 
do is make both CPU and memory a selected type with the parameter:

SelectTypeParameters=CR_CPU_Memory

Then you'll want to define the amount of memory (in megabytes) on a node as 
part of the definition with

RealMemory=

Lastly, you'll need to define a default memory (in megabytes) per job, 
typically by memory per cpu, with

DefMemPerCpu=

With those changes when you submit a job by default it'll do #cpus x 
defmempercpu for the memory given to a job. You can then use either the flags 
--mem or --mempercpu to request more or less memory for a job.

There's also oversubscription where you can allow more memory than available on 
the node to be used by jobs and then you don't technically need to define the 
memory for a job but run into the issue that a single job could use all of it 
and get OOM errors on the nodes.

Regards,

--
Willy Markuske

HPC Systems Engineer
MS Data Science and Engineering
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu

On Jun 16, 2023, at 12:43, Joe Waliga  wrote:

Hello,

(This is my first time submitting a question to the list)

We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 
jobs onto the test-HPC, we can only run one job per node. We seem to be 
allocating all memory to the one job and other jobs can run until the memory is 
freed up.

Any ideas on what we need to change inorder to free up the memory?

~ ~

We noticed this from the 'slurmctld.log' ...

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp01, allocated memory = 1 and all memory requested for 
JobId=71_*
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp02, allocated memory = 1 and all memory requested for 
JobId=71_*

The test-HPC is running on hardware, but we also created a test-HPC using a 3 
VM set constructed by Vagrant running on a Virtualbox backend.

I have included some of the 'slurmctld.log' file, the batch submission script, 
the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' 
file (in case someone wants to recreate our test-HPC in a set of VMs.)

- Joe


- (some of) slurmctld.log -

[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp01, allocated memory = 1 and all memory requested for 
JobId=71_7(71)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp02, allocated memory = 1 and all memory requested for 
JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating 
JobId=71_7(71) on 0 nodes
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 
fail: insufficient resources
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources 
info for JobId=71_7(71) rc=-1
[2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: 
evaluating JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) 
node_mode:Normal alloc_mode:Test_Only
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & 
exc_cores
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 
requested:1 avail:2
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating 
JobId=71_7(71) on 2 nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: