[slurm-users] Need to free up memory for running more than one job on a node
Hello, (This is my first time submitting a question to the list) We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 jobs onto the test-HPC, we can only run one job per node. We seem to be allocating all memory to the one job and other jobs can run until the memory is freed up. Any ideas on what we need to change inorder to free up the memory? ~ ~ We noticed this from the 'slurmctld.log' ... [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_* [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_* The test-HPC is running on hardware, but we also created a test-HPC using a 3 VM set constructed by Vagrant running on a Virtualbox backend. I have included some of the 'slurmctld.log' file, the batch submission script, the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' file (in case someone wants to recreate our test-HPC in a set of VMs.) - Joe - (some of) slurmctld.log - [2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_7(71) [2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_7(71) [2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 0 nodes [2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources [2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=-1 [2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_7(71) [2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & exc_cores [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2 [2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 2 nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp01(state:1), mem 1/1 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp02(state:1), mem 1/1 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/elim_nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec CPUs:64 nodes:2:hpc2-comp[01-02] begin:
Re: [slurm-users] Need to free up memory for running more than one job on a node
Hello Joe, You haven't defined any memory allocation or oversubscription in your slurm.conf so by default it is giving a full node's worth of memory to each job. There are multiple options that you can do but what you probably want to do is make both CPU and memory a selected type with the parameter: SelectTypeParameters=CR_CPU_Memory Then you'll want to define the amount of memory (in megabytes) on a node as part of the definition with RealMemory= Lastly, you'll need to define a default memory (in megabytes) per job, typically by memory per cpu, with DefMemPerCpu= With those changes when you submit a job by default it'll do #cpus x defmempercpu for the memory given to a job. You can then use either the flags --mem or --mempercpu to request more or less memory for a job. There's also oversubscription where you can allow more memory than available on the node to be used by jobs and then you don't technically need to define the memory for a job but run into the issue that a single job could use all of it and get OOM errors on the nodes. Regards, -- Willy Markuske HPC Systems Engineer MS Data Science and Engineering SDSC - Research Data Services (619) 519-4435 wmarku...@sdsc.edu On Jun 16, 2023, at 12:43, Joe Waliga wrote: Hello, (This is my first time submitting a question to the list) We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 jobs onto the test-HPC, we can only run one job per node. We seem to be allocating all memory to the one job and other jobs can run until the memory is freed up. Any ideas on what we need to change inorder to free up the memory? ~ ~ We noticed this from the 'slurmctld.log' ... [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_* [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_* The test-HPC is running on hardware, but we also created a test-HPC using a 3 VM set constructed by Vagrant running on a Virtualbox backend. I have included some of the 'slurmctld.log' file, the batch submission script, the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' file (in case someone wants to recreate our test-HPC in a set of VMs.) - Joe - (some of) slurmctld.log - [2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_7(71) [2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_7(71) [2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 0 nodes [2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources [2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=-1 [2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_7(71) [2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & exc_cores [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2 [2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 2 nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: