sacctmgr list cluster

   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES 
GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   

---------- --------------- ------------ ----- --------- ------- ------------- 
--------- ------- ------------- --------- ----------- -------------------- 

       tuc         6817  8704         1


From: slurm-users <> On Behalf Of Sean 
Sent: Tuesday, April 6, 2021 2:11 PM
To: Slurm User Community List <>
Subject: Re: [slurm-users] [EXT] slurmctld error


I just checked my cluster and my spool dir is




(i.e. without the d at the end)


It doesn't really matter, as long as the directory exists and has the correct 
permissions on all nodes

Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 20:52, Sean Crosby < 
<> > wrote:

I think I've worked out a problem


I see in your slurm.conf you have this




It should be




You'll need to restart slurmd on all the nodes after you make that change


I would also double check the permissions on that directory on all your nodes. 
It needs to be owned by user slurm


ls -lad /var/spool/slurmd




Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 20:37, Sean Crosby < 
<> > wrote:

It looks like your ctl isn't contacting the slurmdbd properly. The control 
host, control port etc are all blank.


The first thing I would do is change the ClusterName in your slurm.conf from 
upper case TUC to lower case tuc. You'll then need to restart your ctld. Then 
recheck sacctmgr show cluster


If that doesn't work, try changing AccountingStorageHost in slurm.conf to 
localhost as well


For your worker nodes, your nodes are all in drain state.


Show the output of


scontrol show node wn001


It will give you the reason for why the node is drained.




Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 20:19, < <> > 

UoM notice: External email. Be cautious of links, attachments, or impersonation 



sinfo -N -o "%N %T %C %m %P %a"


wn001 drained 0/0/2/2 3934 TUC* up

wn002 drained 0/0/2/2 3934 TUC* up

wn003 drained 0/0/2/2 3934 TUC* up

wn004 drained 0/0/2/2 3934 TUC* up

wn005 drained 0/0/2/2 3934 TUC* up

wn006 drained 0/0/2/2 3934 TUC* up

wn007 drained 0/0/2/2 3934 TUC* up

wn008 drained 0/0/2/2 3934 TUC* up

wn009 drained 0/0/2/2 3934 TUC* up

wn010 drained 0/0/2/2 3934 TUC* up

wn011 drained 0/0/2/2 3934 TUC* up

wn012 drained 0/0/2/2 3934 TUC* up

wn013 drained 0/0/2/2 3934 TUC* up

wn014 drained 0/0/2/2 3934 TUC* up

wn015 drained 0/0/2/2 3934 TUC* up

wn016 drained 0/0/2/2 3934 TUC* up

wn017 drained 0/0/2/2 3934 TUC* up

wn018 drained 0/0/2/2 3934 TUC* up

wn019 drained 0/0/2/2 3934 TUC* up

wn020 drained 0/0/2/2 3934 TUC* up

wn021 drained 0/0/2/2 3934 TUC* up

wn022 drained 0/0/2/2 3934 TUC* up

wn023 drained 0/0/2/2 3934 TUC* up

wn024 drained 0/0/2/2 3934 TUC* up

wn025 drained 0/0/2/2 3934 TUC* up

wn026 drained 0/0/2/2 3934 TUC* up

wn027 drained 0/0/2/2 3934 TUC* up

wn028 drained 0/0/2/2 3934 TUC* up

wn029 drained 0/0/2/2 3934 TUC* up

wn030 drained 0/0/2/2 3934 TUC* up

wn031 drained 0/0/2/2 3934 TUC* up

wn032 drained 0/0/2/2 3934 TUC* up

wn033 drained 0/0/2/2 3934 TUC* up

wn034 drained 0/0/2/2 3934 TUC* up

wn035 drained 0/0/2/2 3934 TUC* up

wn036 drained 0/0/2/2 3934 TUC* up

wn037 drained 0/0/2/2 3934 TUC* up

wn038 drained 0/0/2/2 3934 TUC* up

wn039 drained 0/0/2/2 3934 TUC* up

wn040 drained 0/0/2/2 3934 TUC* up

wn041 drained 0/0/2/2 3934 TUC* up

wn042 drained 0/0/2/2 3934 TUC* up

wn043 drained 0/0/2/2 3934 TUC* up

wn044 drained 0/0/2/2 3934 TUC* up


From: slurm-users < 
<> > On Behalf Of Sean Crosby
Sent: Tuesday, April 6, 2021 12:47 PM
To: Slurm User Community List < 
<> >
Subject: Re: [slurm-users] [EXT] slurmctld error


It looks like your attachment of sinfo -R didn't come through


It also looks like your dbd isn't set up correctly


Can you also show the output of


sacctmgr list cluster




scontrol show config | grep ClusterName




Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 19:18, Ioannis Botsis < 
<> > wrote:

UoM notice: External email. Be cautious of links, attachments, or impersonation 



Hi Sean,


I am trying to submit a simple job but freeze


srun -n44 -l /bin/hostname

srun: Required node not available (down, drained or reserved)

srun: job 15 queued and waiting for resources

^Csrun: Job allocation 15 has been revoked

srun: Force Terminated job 15



daemons are active and running on server and all nodes 


nodes definition in slurm.conf is …



NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2 

PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP


tail -10 /var/log/slurmdbd.log


[2021-04-06T12:09:16.481] error: We should have gotten a new id: Table 
'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to register a 
cluster (tuc) with no remote port

[2021-04-06T12:09:16.482] error: We should have gotten a new id: Table 
'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.482] error: It looks like the storage has gone away trying 
to reconnect

[2021-04-06T12:09:16.483] error: We should have gotten a new id: Table 
'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to register a 
cluster (tuc) with no remote port

[2021-04-06T12:09:16.484] error: We should have gotten a new id: Table 
'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.484] error: It looks like the storage has gone away trying 
to reconnect

[2021-04-06T12:09:16.484] error: We should have gotten a new id: Table 
'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to register a 
cluster (tuc) with no remote port


tail -10 /var/log/slurmctld.log


[2021-04-06T12:09:35.701] debug:  backfill: no jobs to backfill

[2021-04-06T12:09:42.001] debug:  slurmdbd: PERSIST_RC is -1 from 
DBD_FLUSH_JOBS(1408): (null)

[2021-04-06T12:10:00.042] debug:  slurmdbd: PERSIST_RC is -1 from 
DBD_FLUSH_JOBS(1408): (null)

[2021-04-06T12:10:05.701] debug:  backfill: beginning

[2021-04-06T12:10:05.701] debug:  backfill: no jobs to backfill

[2021-04-06T12:10:05.989] debug:  sched: Running job scheduler

[2021-04-06T12:10:19.001] debug:  slurmdbd: PERSIST_RC is -1 from 
DBD_FLUSH_JOBS(1408): (null)

[2021-04-06T12:10:35.702] debug:  backfill: beginning

[2021-04-06T12:10:35.702] debug:  backfill: no jobs to backfill

[2021-04-06T12:10:37.001] debug:  slurmdbd: PERSIST_RC is -1 from 
DBD_FLUSH_JOBS(1408): (null)


Attached sinfo -R  


Any hint?




From: slurm-users < 
<> > On Behalf Of Sean Crosby
Sent: Tuesday, April 6, 2021 7:54 AM
To: Slurm User Community List < 
<> >
Subject: Re: [slurm-users] [EXT] slurmctld error


The other thing I notice for my slurmdbd.conf is that I have




You can try changing your slurmdbd.conf to set those 2 values as well to see if 
that gets slurmdbd to listen on port 6819




Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 14:31, Sean Crosby < <>> wrote:

Interesting. It looks like slurmdbd is not opening the 6819 port


What does


ss -lntp | grep 6819


show? Is something else using that port?


You can also stop the slurmdbd service and run it in debug mode using


slurmdbd -D -vvv




Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 14:02, < <>> 

UoM notice: External email. Be cautious of links, attachments, or impersonation 



Hi Sean


ss -lntp | grep $(pidof slurmdbd)     return nothing……


systemctl status slurmdbd.service


● slurmdbd.service - Slurm DBD accounting daemon

     Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled; vendor 
preset: enabled)

     Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago

       Docs: man:slurmdbd(8)

    Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS 
(code=exited, status=0/SUCCESS)

   Main PID: 1453375 (slurmdbd)

      Tasks: 1

     Memory: 5.0M

     CGroup: /system.slice/slurmdbd.service

             └─1453375 /usr/sbin/slurmdbd


Apr 05 13:52:35  <> systemd[1]: 
Starting Slurm DBD accounting daemon...

Apr 05 13:52:35  <> systemd[1]: 
slurmdbd.service: Can't open PID file /run/ (yet?) after start: 
Operation not permitted

Apr 05 13:52:35  <> systemd[1]: Started 
Slurm DBD accounting daemon.


File /run/ exist and has  pidof slurmdbd   value….


From: slurm-users < <>> On Behalf Of Sean Crosby
Sent: Tuesday, April 6, 2021 12:49 AM
To: Slurm User Community List < <>>
Subject: Re: [slurm-users] [EXT] slurmctld error


What's the output of


ss -lntp | grep $(pidof slurmdbd)


on your dbd host?




Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 6 Apr 2021 at 05:00, < <>> 

UoM notice: External email. Be cautious of links, attachments, or impersonation 



Hi Sean, is the dbd and ctld host with name se01. Firewall is inactive……


nc -nz 6819 || echo Connection not working


give me back …..  Connection not working





From: slurm-users < <>> On Behalf Of Sean Crosby
Sent: Monday, April 5, 2021 2:52 PM
To: Slurm User Community List < <>>
Subject: Re: [slurm-users] [EXT] slurmctld error


The error shows

slurmctld: debug2: Error connecting slurm stream socket at  
<> Connection refused

slurmctld: error: slurm_persist_conn_open_without_init: failed to open 
persistent connection to se01:6819: Connection refused


Is the IP address of the host running slurmdbd?

If so, check the iptables firewall running on that host, and make sure the ctld 
server can access port 6819 on the dbd host.

You can check this by running the following from the ctld host (requires the 
package nmap-ncat installed)

nc -nz 6819 || echo Connection not working

This will try connecting to port 6819 on the host, and output 
nothing if the connection works, and would output Connection not working 

I would also test this on the DBD server itself


Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Mon, 5 Apr 2021 at 21:00, Ioannis Botsis < <>> wrote:

UoM notice: External email. Be cautious of links, attachments, or impersonation 



Hi Sean,


Thank you for your prompt response,  I made the changes you suggested, 
slurmctld refuse running……. find attached new slurmctld -Dvvvv






From: slurm-users < <>> On Behalf Of Sean Crosby
Sent: Monday, April 5, 2021 11:46 AM
To: Slurm User Community List < <>>
Subject: Re: [slurm-users] [EXT] slurmctld error


Hi Jb,


You have set AccountingStoragePort to 3306 in slurm.conf, which is the MySQL 
port running on the DBD host.


AccountingStoragePort is the port for the Slurmdbd service, and not for MySQL.


Change AccountingStoragePort to 6819 and it should fix your issues.


I also think you should comment out the lines 




You shouldn't need those lines




Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis < <>> wrote:

UoM notice: External email. Be cautious of links, attachments, or impersonation 



Hello everyone,


I installed the slurm 19.05.5 from Ubuntu repo,  for the first time in a 
cluster with 44  identical nodes but I have problem with slurmctld.service


When I try to activate slurmctd I get the following message…


fatal: You are running with a database but for some reason we have no TRES from 
it.  This should only happen if the database is down and you don't have any 
state files


*       Ubuntu 20.04.2 runs on the server and nodes in the exact same version.
*       munge 0.5.13 installed from Ubuntu repo running on server and nodes.
*       mysql  Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))  
installed from ubuntu repo running on server.


slurm.conf is the same on all nodes and on server.


slurmd.service is active and running on all nodes without problem.


mysql.service is active and running on server.

slurmdbd.service is active and running on server (slurm_acct_db created).


Find attached slurm.conf  <>  and detailed 
output of slurmctld -Dvvvv  command.


Any hint?


Thanks in advance






Attachment: slurmd.log
Description: Binary data

Attachment: show_config
Description: Binary data

Reply via email to