Hello,

(This is my first time submitting a question to the list)

We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 jobs onto the test-HPC, we can only run one job per node. We seem to be allocating all memory to the one job and other jobs can run until the memory is freed up.

Any ideas on what we need to change inorder to free up the memory?

~ ~

We noticed this from the 'slurmctld.log' ...

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_* [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_*

The test-HPC is running on hardware, but we also created a test-HPC using a 3 VM set constructed by Vagrant running on a Virtualbox backend.

I have included some of the 'slurmctld.log' file, the batch submission script, the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' file (in case someone wants to recreate our test-HPC in a set of VMs.)

- Joe


----- (some of) slurmctld.log -----------------------------------------

[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_7(71) [2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_7(71) [2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 0 nodes [2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources [2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=-1 [2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_7(71) [2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & exc_cores [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2 [2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 2 nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp01(state:1), mem 1/1 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp02(state:1), mem 1/1 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/elim_nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec CPUs:64 nodes:2:hpc2-comp[01-02] begin:0 end:1 required:-1 weight:511 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/choose_nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/sync_cores [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: test 0 pass: test_only [2023-06-15T20:11:32.632] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=0 [2023-06-15T20:11:32.632] debug3: sched: JobId=71_*. State=PENDING. Reason=Resources. Priority=4294901759. Partition=debug.
[2023-06-15T20:11:56.645] debug:  Spawning ping agent for hpc2-comp[01-02]
[2023-06-15T20:11:56.645] debug2: Spawning RPC agent for msg_type REQUEST_PING
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp01
[2023-06-15T20:11:56.646] debug2: Tree head got back 0 looking for 2
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp02
[2023-06-15T20:11:56.647] debug2: Tree head got back 1
[2023-06-15T20:11:56.647] debug2: Tree head got back 2
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp01
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp02
[2023-06-15T20:11:57.329] debug: sched/backfill: _attempt_backfill: beginning [2023-06-15T20:11:57.329] debug: sched/backfill: _attempt_backfill: 1 jobs to backfill [2023-06-15T20:11:57.329] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=71_*. [2023-06-15T20:11:57.329] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_* [2023-06-15T20:11:57.329] select/cons_tres: common_job_test: JobId=71_* node_mode:Normal alloc_mode:Will_Run [2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list & exc_cores [2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:57.329] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2 [2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_* [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_* [2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_* on 0 nodes [2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources [2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_5(76): overlap=1 [2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) action:normal
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] JobId=71_5(76) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp01
[2023-06-15T20:11:57.330] Node[0]:
[2023-06-15T20:11:57.330]   Mem(MB):1:0  Sockets:2  Cores:8  CPUs:2:0
[2023-06-15T20:11:57.330]   Socket[0] Core[0] is allocated
[2023-06-15T20:11:57.330] --------------------
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] debug3: select/cons_tres: job_res_rm_job: removed JobId=71_5(76) from part debug row 0 [2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) finished [2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_6(77): overlap=1 [2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_6(77) action:normal
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] JobId=71_6(77) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp02
[2023-06-15T20:11:57.330] Node[0]:
[2023-06-15T20:11:57.330]   Mem(MB):1:0  Sockets:2  Cores:8  CPUs:2:0
[2023-06-15T20:11:57.330]   Socket[0] Core[0] is allocated
[2023-06-15T20:11:57.330] --------------------
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1
[2023-06-15T20:11:57.330] ====================

----- batch script -----------------------------------

#!/bin/bash

echo "Running on: ${SLURM_CLUSTER_NAME}, node list: ${SLURM_JOB_NODELIST}, node names: ${SLURMD_NODENAME} in: `pwd` at `date`" echo "SLURM_NTASKS: ${SLURM_NTASKS} SLURM_TASKS_PER_NODE: ${SLURM_TASKS_PER_NODE} "
echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}"
echo "SLURM_MEM_PER_CPU: ${SLURM_MEM_PER_CPU}"

sleep 3600

echo "END"

Here is the sbatch command to run it:

sbatch -J test -a1-10 -t 999:00:00 -N 1 -n 1 -p debug sbatch.slurm

----- slurm.conf -----------------------------------

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=hpc2-comp00
#SlurmctldHost=
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
SrunPortRange=60001-60005
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
#TaskPlugin=task/affinity,task/cgroup
TaskPlugin=task/none
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=hpc2-comp00
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreFlags=job_comment,job_env,job_extra,job_script
#JobCompHost=localhost
#JobCompLoc=slurm_jobcomp_db
##JobCompParams=
#JobCompPass=/var/run/munge/munge.socket.2
#JobCompPort=3306
#JobCompType=jobcomp/mysql
JobCompType=jobcomp/none
#JobCompUser=slurm
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
# Enabled next line - 06-15-2023
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurmctld.log
# Enabled next line - 06-15-2023
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=

# Added next line : 06-15-2023
DebugFlags=Cgroup,CPU_Bind,Data,Gres,NodeFeatures,SelectType,Steps,TraceJobs

#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=hpc2-comp[01-02] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=debug Nodes=hpc2-comp[01-02] Default=YES MaxTime=INFINITE State=UP

----- Vagrantfile file -----------------------------------

# -*- mode: ruby -*-
# vi: set ft=ruby :

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
# The most common configuration options are documented and commented below.
  # For a complete reference, please see the online documentation at
  # https://docs.vagrantup.com.

# Every Vagrant development environment requires a box. You can search for
  # boxes at https://vagrantcloud.com/search.
  config.vm.box = "generic/fedora37"

# Stop vagrant from generating a new key for each host to allow ssh between
  # machines.
  config.ssh.insert_key = false

  # The Vagrant commands are too limited to configure a NAT network,
  # so run the VBoxManager commands by hand.
  config.vm.provider "virtualbox" do |vbox|
    # Add nic2 (eth1 on the guest VM) as the physical router.  Never change
# nic1, because that's what the host uses to communicate with the guest VM.
    vbox.customize ["modifyvm", :id,
                    "--nic2", "bridged",
                    "--bridge-adapter2", "enp8s0"]
  end

  # Common provisioning for all guest VMs.
  config.vm.provision "shell", inline: <<-SHELL
    # Show which command is being run to associate with command output!
    set -x

    # Remove suprious hosts from the VM image.
    sed -i '/fedora37/d' /etc/hosts
    sed -i '/^127[.]0[.]1[.]1/d' /etc/hosts

    # Add NAT network to /etc/hosts.
    for host in 10.0.1.{100..102}
    do
        hostname=hpc2-comp${host:8}
        grep -q $host /etc/hosts ||
             echo "$host   $hostname" >> /etc/hosts
    done
    unset host hostname

    # Use latest set of packages.
    dnf -y update

    # Install MUNGE.
    dnf -y install munge

    # Create the SLURM user.
    id -u slurm ||
useradd -r -s /sbin/nologin -d /etc/slurm -c "SLURM job scheduler" slurm
  SHELL

  config.vm.define "hpc2_comp00" do |hpc2_comp00|
    hpc2_comp00.vm.hostname = "hpc2-comp00"
    hpc2_comp00.vm.synced_folder ".", "/vagrant", automount: true
    hpc2_comp00.vm.provision :shell, inline: <<-SHELL
      # Show which command is being run to associate with command output!
      set -x

      # Set static IP address for NAT network.
      HOST=10.0.1.100
      ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
      sed "s|address1=10.0.1.100|address1=${HOST}|" \
          /vagrant/eth1.nmconnection > $ETH1
      chmod go-r $ETH1
      nmcli con load $ETH1
      unset $HOST

      # Create the MUNGE key.
      [[ -f /etc/munge/munge.key ]] || sudo -u munge /usr/sbin/mungekey -v
      cp -av /etc/munge/munge.key /vagrant/

      # Enable and start munge.
      systemctl enable munge
      systemctl start munge
      systemctl status munge

      # Setup database on the head node:
      dnf -y install mariadb-devel mariadb-server

      # Set recomended memory (5%-50% RAM) and timeout.
      CNF=/etc/my.cnf.d/mariadb-server.cnf

      # Note we need to use a double slash for the newline character below
      # because Vagrant's inline shell script.

MYSQL_RAM=$(awk '/^MemTotal/ {printf "%.0f\\n", $2*0.05}' /proc/meminfo)
      grep -q innodb_buffer_pool_size $CNF ||
           sed -i '/InnoDB/a innodb_buffer_pool_size='${MYSQL_RAM}K $CNF
      grep -q innodb_lock_wait_timeout $CNF ||
sed -i '/innodb_buffer_pool_size/a innodb_lock_wait_timeout=900' $CNF
      unset CNF MYSQL_RAM

      # Run the head node services:
      systemctl enable mariadb
      systemctl start mariadb
      systemctl status mariadb

      # Secure the server.
      #
      # Send interactive commands using printf per
      # https://unix.stackexchange.com/a/112348
      printf "%s\n" "" n n y y y y | mariadb-secure-installation

      # Install the RPM package builder for SLURM.
      dnf -y install rpmdevtools

      # Download SLURM.
      wget -nc https://download.schedmd.com/slurm/slurm-23.02.0.tar.bz2
# Install the source package dependencies to determine the dependencies
      # to build the binary.
      dnf -y install \
          dbus-devel \
          freeipmi-devel \
          hdf5-devel \
          http-parser-devel \
          json-c-devel \
          libcurl-devel \
          libjwt-devel \
          libyaml-devel \
          lua-devel \
          lz4-devel \
          man2html \
          readline-devel \
          rrdtool-devel
      SLURM_SRPM=~/rpmbuild/SRPMS/slurm-23.02.0-1.fc37.src.rpm
      # Create SLURM .rpmmacros file.
      cp -av /vagrant/.rpmmacros .
      [[ -f ${SLURM_SRPM} ]] ||
rpmbuild -ts slurm-23.02.0.tar.bz2 |& tee build-slurm-source.log
      # Installs the source package dependencies to build the binary.
      dnf -y builddep ${SLURM_SRPM}
      unset SLURM_SRPM
      # Build the SLURM binaries.
      SLURM_RPM=~/rpmbuild/RPMS/x86_64/slurm-23.02.0-1.fc37.x86_64.rpm
      [[ -f ${SLURM_RPM} ]] ||
rpmbuild -ta slurm-23.02.0.tar.bz2 |& tee build-slurm-binary.log
      unset SLURM_RPM

      # Copy SLURM packages to the compute nodes.
      DIR_RPM=~/rpmbuild/RPMS/x86_64
cp -av ${DIR_RPM}/slurm-23*.rpm ${DIR_RPM}/slurm-slurmd-23*.rpm /vagrant/ &&
         touch /vagrant/sentinel-copied-rpms.done

      # Install all SLURM packages on the head node.
find ${DIR_RPM} -type f -not -name '*-slurmd-*' -not -name '*-torque-*' \
           -exec dnf -y install {} +
      unset DIR_RPM
      # Copy the configuration files.
      cp -av /vagrant/slurmdbd.conf /etc/slurm/slurmdbd.conf
      chown slurm:slurm /etc/slurm/slurmdbd.conf
      chmod 600 /etc/slurm/slurmdbd.conf
      cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
      chown root:root /etc/slurm/slurm.conf
      chmod 644 /etc/slurm/slurm.conf

      # Now create the slurm MySQL user.
SLURM_PASSWORD=$(awk -vFS='=' '/StoragePass/ {print $2}' /etc/slurm/slurmdbd.conf)
      DBD_HOST=localhost
# https://docs.fedoraproject.org/en-US/quick-docs/installing-mysql-mariadb/#_start_mysql_service_and_enable_at_loggin mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';"
      mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"
      mysql -e "show grants for 'slurm'@'$DBD_HOST';"
      DBD_HOST=hpc2-comp00
mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';"
      mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"
      mysql -e "show grants for 'slurm'@'$DBD_HOST';"
      unset SLURM_PASSWORD DBD_HOST

      systemctl enable slurmdbd
      systemctl start slurmdbd
      systemctl status slurmdbd

      systemctl enable slurmctld
      mkdir -p /var/spool/slurmctld
      chown slurm:slurm /var/spool/slurmctld
      # Open ports for slurmctld (6817) and slurmdbd (6819).
      firewall-cmd --add-port=6817/tcp
      firewall-cmd --add-port=6819/tcp
      firewall-cmd --runtime-to-permanent
      systemctl start slurmctld
      systemctl status slurmctld

      # Clear any previous node DOWN errors.
      sinfo -s
      sinfo -R
      scontrol update nodename=ALL state=RESUME
      sinfo -s
      sinfo -R
  SHELL
  end

  config.vm.define "hpc2_comp01" do |hpc2_comp01|
    hpc2_comp01.vm.hostname = "hpc2-comp01"
    hpc2_comp01.vm.synced_folder ".", "/vagrant", automount: true
    hpc2_comp01.vm.provision :shell, inline: <<-SHELL
      # Show which command is being run to associate with command output!
      set -x

      # Set static IP address for NAT network.
      HOST=10.0.1.101
      ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
      sed "s|address1=10.0.1.100|address1=${HOST}|" \
          /vagrant/eth1.nmconnection > $ETH1
      chmod go-r $ETH1
      nmcli con load $ETH1
      unset $HOST

      # Copy the MUNGE key.
      KEY=/etc/munge/munge.key
      cp -av /vagrant/munge.key /etc/munge/
      chown munge:munge $KEY
      chmod 600 $KEY

      # Enable and start munge.
      systemctl enable munge
      systemctl start munge
      systemctl status munge

      # SLURM packages to be installed on compute nodes.
      DIR_RPM=~/rpmbuild/RPMS/x86_64
      mkdir -p ${DIR_RPM}
      rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/
      dnf -y install ${DIR_RPM}/slurm-23*.rpm
      dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm
      unset DIR_RPM
      # Copy the configuration file.
      mkdir -p /etc/slurm
      cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
      chown root:root /etc/slurm/slurm.conf
      chmod 644 /etc/slurm/slurm.conf
      # Only enable slurmd on the worker nodes.
      systemctl enable slurmd
      # Open port for slurmd (6818).
      firewall-cmd --add-port=6818/tcp
      firewall-cmd --runtime-to-permanent
      # Open port range for srun.
SRUN_PORT_RANGE=$(awk -vFS='=' '/SrunPortRange/ {print $2}' /etc/slurm/slurm.conf)
      firewall-cmd --add-port=$SRUN_PORT_RANGE/tcp
      firewall-cmd --runtime-to-permanent
      systemctl start slurmd
      systemctl status slurmd

      # Clear any previous node DOWN errors.
      sinfo -s
      sinfo -R
      scontrol update nodename=ALL state=RESUME
      sinfo -s
      sinfo -R
  SHELL
  end

  config.vm.define "hpc2_comp02" do |hpc2_comp02|
    hpc2_comp02.vm.hostname = "hpc2-comp02"
    hpc2_comp02.vm.synced_folder ".", "/vagrant", automount: true
    hpc2_comp02.vm.provision :shell, inline: <<-SHELL
      # Show which command is being run to associate with command output!
      set -x

      # Set static IP address for NAT network.
      HOST=10.0.1.102
      ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
      sed "s|address1=10.0.1.100|address1=${HOST}|" \
          /vagrant/eth1.nmconnection > $ETH1
      chmod go-r $ETH1
      nmcli con load $ETH1
      unset $HOST

      # Copy the MUNGE key.
      KEY=/etc/munge/munge.key
      cp -av /vagrant/munge.key /etc/munge/
      chown munge:munge $KEY
      chmod 600 $KEY

      # Enable and start munge.
      systemctl enable munge
      systemctl start munge
      systemctl status munge

      # SLURM packages to be installed on compute nodes.
      DIR_RPM=~/rpmbuild/RPMS/x86_64
      mkdir -p ${DIR_RPM}
      rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/
      dnf -y install ${DIR_RPM}/slurm-23*.rpm
      dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm
      unset DIR_RPM
      # Copy the configuration file.
      mkdir -p /etc/slurm
      cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
      chown root:root /etc/slurm/slurm.conf
      chmod 644 /etc/slurm/slurm.conf
      # Only enable slurmd on the worker nodes.
      systemctl enable slurmd
      # Open port for slurmd (6818).
      firewall-cmd --add-port=6818/tcp
      firewall-cmd --runtime-to-permanent
      systemctl start slurmd
      systemctl status slurmd

      # Clear any previous node DOWN errors.
      sinfo -s
      sinfo -R
      scontrol update nodename=ALL state=RESUME
      sinfo -s
      sinfo -R
  SHELL
  end
end

-------------------------------------------------------------------------

Reply via email to