I’m pretty sure that you should only need to restart slurmd on the node that was reporting the problem. If it put the node into a drained state you may need to manually undrain it using scontrol.
Testing job performance is not the job of the scheduler it just schedules the jobs that you tell it to. You’ll need to run those tests yourself. Mike From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Robert Kudyba <rkud...@fordham.edu> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com> Date: Thursday, April 23, 2020 at 12:55 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw) CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe. On Thu, Apr 23, 2020 at 1:43 PM Michael Robbert <mrobb...@mines.edu> wrote: It looks like you have hyper-threading turned on, but haven’t defined the ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS or changed the definition of ThreadsPerCore in slurm.conf. Nice find. node003 has hyper threading enabled but node001 and node002 do not: [root@node001 ~]# dmidecode -t processor | grep -E '(Core Count|Thread Count)' Core Count: 12 Thread Count: 12 Core Count: 12 Thread Count: 12 [root@node003 ~]# dmidecode -t processor | grep -E '(Core Count|Thread Count)' Core Count: 12 Thread Count: 24 Core Count: 12 I found a great mini script to disable hyperthreading without reboot. I did get the following warning but I don't think it's a big issue: WARNING, didn't collect load info for all cpus, balancing is broken Do I have to restart slurmctl on the head node and/or slurmd on node003? Side question, are there ways with Slurm to test if hyperthreading improves performance and job speed?
smime.p7s
Description: S/MIME cryptographic signature