<https://superuser.com/posts/1862116/timeline>
I have reinstalled slurm resource management on a HPC cluster. But it seems there is a problem on starting slurmd services. Here is the system status: "systemctl status slurmd" shows: ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Sun 2024-11-10 11:31:24 +0130; 1 weeks 1 days ago ------------------------------------------- The slurm.conf output is: # # Example slurm.conf file. Please run configurator.html # (in doc/html) to build a configuration file customized # for your environment. # # # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=linux ControlMachine=master.cluster.... #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/d SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #ProctrackType=proctrack/pgid ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup TaskPluginParam=cpusets #PluginDir= #FirstJobId= #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # GPU definition (Added by S 2022/11) GresTypes=gpu # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SelectType=select/linear SelectType=select/cons_res # select partial node #SelectTypeParameters=CR_CPU_Memory SelectTypeParameters=CR_Core_Memory #FastSchedule=1 FastSchedule=0 PriorityType=priority/multifactor PriorityDecayHalfLife=0 PriorityUsageResetPeriod=NONE PriorityWeightFairshare=100000 #PriorityWeightAge=1000 PriorityWeightPartition=10000 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 DefMemPerCPU=2000 # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none #JobCompLoc= # # ACCOUNTING #JobAcctGatherType=jobacct_gather/linux #JobAcctGatherFrequency=30 # AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=master #AccountingStorageLoc= #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # OpenHPC default configuration # Edited by Surin 1399/11 #AccountingStorageEnforce=limits AccountingStorageEnforce=QOS,Limits,Associations #TaskPlugin=task/affinity #PropagateResourceLimitsExcept=MEMLOCK #AccountingStorageType=accounting_storage/filetxt Epilog=/etc/slurm/slurm.epilog.clean NodeName=cn0[1-5] NodeHostName=cn0[1-5] RealMemory=128307 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Feature=HyperThread State=UNKNOWN NodeName=gp01 NodeHostName=gp01 RealMemory=128307 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Feature=HyperThread Gres=gpu:4 State=UNKNOWN #PartitionName=all Nodes=cn0[1-5],gp01 MaxTime=INFINITE State=UP Oversubscribe=EXCLUSIVE PartitionName=all Nodes=cn0[1-5],gp01 MaxTime=10-00:00:00 State=UP MaxNodes=1 PartitionName=normal Nodes=cn0[1-5] Default=YES MaxTime=10-00:00:00 State=UP MaxNodes=1 #PartitionName=normal Nodes=cn0[1-5] Default=YES MaxTime=INFINITE State=UP Oversubscribe=EXCLUSIVE PartitionName=gpu Nodes=gp01 MaxTime=10-00:00:00 State=UP SlurmctldParameters=enable_configless ReturnToService=1 HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=300 ------------------------------------------- scontrol show node cn01 shows: NodeName=cn01 CoresPerSocket=16 CPUAlloc=0 CPUTot=64 CPULoad=N/A AvailableFeatures=HyperThread ActiveFeatures=HyperThread Gres=(null) NodeAddr=cn01 NodeHostName=cn01 RealMemory=128557 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=DOWN* ThreadsPerCore=2 TmpDisk=64278 Weight=1 Owner=N/A MCS_label=N/A Partitions=all,normal BootTime=None SlurmdStartTime=None CfgTRES=cpu=64,mem=128557M,billing=64 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=undraining ------------------------------------------- "scontrol ping" also works as: Slurmctld(primary) at master.cluster... is UP ------------------------------------------- "systemctl start slurmd" shows: Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details. ------------------------------------------- "systemctl status slurmd.service" shows: ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2024-11-18 17:02:50 +0130; 41s ago Process: 219025 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Nov 18 17:02:50 master.cluster.... systemd[1]: Starting Slurm node daemon... Nov 18 17:02:50 master.cluster.... slurmd[219025]: fatal: Unable to determine this slurmd's NodeName Nov 18 17:02:50 master.cluster.... systemd[1]: slurmd.service: control process exited, code=exited status=1 Nov 18 17:02:50 master.cluster.... systemd[1]: Failed to start Slurm node daemon. Nov 18 17:02:50 master.cluster.... systemd[1]: Unit slurmd.service entered failed state. Nov 18 17:02:50 master.cluster.... systemd[1]: slurmd.service failed. ------------------------------------------- "journalctl -xe" output is: Nov 18 17:04:54 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:04 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:08 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:12 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:22 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:05:26 master.cluster.... dhcpd[2250]: DHCPDISCOVER from 70:35:09:f8:13:40 via eno2: network eno2: no free leases Nov 18 17:06:34 master.cluster.... munged[2514]: Purged 2 credentials from replay hash ------------------------------------------- slurmd.services contains the following information: [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target [root@master system]# which slurmd /usr/sbin/slurmd [root@master system]# cat slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target ------------------------------------------- For "which slurmd" command I get this address: /usr/sbin/slurm ------------------------------------------- ls -l /usr/sbin/slurmd shows: lrwxrwxrwx 1 root root 69 Nov 10 10:54 /usr/sbin/slurmd -> /install/centos7.9/compute_gpu/rootimg/usr/sbin/slurmd
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com