Pierre-Marie, Here is what I have in slurmd.log on tars-XXX
-sh-4.1$ sudo cat slurmd.log 2017-10-09T17:09:57.538636+02:00 tars-XXX slurmd[18597]: Message aggregation enabled: WindowMsgs=24, WindowTime=200 2017-10-09T17:09:57.647486+02:00 tars-XXX slurmd[18597]: CPU frequency setting not configured for this node 2017-10-09T17:09:57.647499+02:00 tars-XXX slurmd[18597]: Resource spec: Reserved system memory limit not configured for this node 2017-10-09T17:09:57.808352+02:00 tars-XXX slurmd[18597]: cgroup namespace 'freezer' is now mounted 2017-10-09T17:09:57.844400+02:00 tars-XXX slurmd[18597]: cgroup namespace 'cpuset' is now mounted 2017-10-09T17:09:57.902418+02:00 tars-XXX slurmd[18640]: slurmd version 16.05.9 started 2017-10-09T17:09:57.957030+02:00 tars-XXX slurmd[18640]: slurmd started on Mon, 09 Oct 2017 17:09:57 +0200 2017-10-09T17:09:57.957336+02:00 tars-XXX slurmd[18640]: CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=258373 TmpDisk=129186 Uptime=74 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03 From: "Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com> Reply-To: slurm-dev <slurm-dev@schedmd.com> Date: Tuesday, 10 October 2017 at 15:20 To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk Véronique, This not what I expected, I was thinking slurmd -C would return TmpDisk=204000 or more probably 129186 as seen in slurmctld log. I suppose that you already checked slurmd logs on tars-XXX ? Regards, Pierre-Marie Le Biot From: Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] Sent: Tuesday, October 10, 2017 2:09 PM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk Hello Pierre-Marie, First, thank you for your hint. I just tried. >slurmd -C NodeName=tars-XXX CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=258373 TmpDisk=500 UpTime=0-20:50:54 The value for TmpDisk is erroneous. I do not know what can be the cause of this since the operating system df command gives the right values. -sh-4.1$ df -hl Filesystem Size Used Avail Use% Mounted on slash_root 3.5G 1.6G 1.9G 47% / tmpfs 127G 0 127G 0% /dev/shm tmpfs 500M 84K 500M 1% /tmp /dev/sda1 200G 33M 200G 1% /local/scratch Could slurmd be messing up tmpfs with /local/scratch? I tried the same thing on another similar node (tars-XXX-1) I got: -sh-4.1$ df -hl Filesystem Size Used Avail Use% Mounted on slash_root 3.5G 1.7G 1.8G 49% / tmpfs 127G 0 127G 0% /dev/shm tmpfs 500M 5.7M 495M 2% /tmp /dev/sda1 200G 33M 200G 1% /local/scratch and slurmd -C NodeName=tars-XXX-1 CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=258373 TmpDisk=500 UpTime=101-21:34:14 So, slurmd –C gives exactly the same answer but this node doesn’t go into DRAIN state; it works perfectly. Thank you again for your help. Regards, Véronique -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03 From: "Le Biot, Pierre-Marie" <pierre-marie.leb...@hpe.com<mailto:pierre-marie.leb...@hpe.com>> Reply-To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Date: Tuesday, 10 October 2017 at 13:53 To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk Hi Véronique, Did you check the result of slurmd -C on tars-XXX ? Regards, Pierre-Marie Le Biot From: Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr] Sent: Tuesday, October 10, 2017 12:02 PM To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>> Subject: [slurm-dev] Node always going to DRAIN state with reason=Low TmpDisk Hello, I have a problem with 1 node in our cluster. It is exactly as all the other nodes (200 GB of temporary storage) Here is what I have in slurm.conf: # COMPUTES TmpFS=/local/scratch # NODES GresTypes=disk,gpu ReturnToService=2 NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000 NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373 Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20 The node that has the trouble is tars-XXX. Here is what I have in gres.conf: # Local disk space in MB (/local/scratch) NodeName=tars-[ZZZ-UUU] Name=disk Count=204000 XXX is in range: [ZZZ,UUU]. If I ssh to tars-XXX, here is what I get: -sh-4.1$ df -hl Filesystem Size Used Avail Use% Mounted on slash_root 3.5G 1.6G 1.9G 47% / tmpfs 127G 0 127G 0% /dev/shm tmpfs 500M 84K 500M 1% /tmp /dev/sda1 200G 33M 200G 1% /local/scratch /local/scratch is the directory for temporary storage. The problem is when I do scontrol show node tars-XXX, I get: NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00 AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin Gres=disk:204000,gpu:0 NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05 OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 Owner=N/A MCS_label=N/A BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low TmpDisk [slurm@2017-10-10T11:25:04] And in the slurmctld logs, I get the error message: 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node tars-XXX has low tmp_disk size (129186 < 204000) 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: _slurm_rpc_node_registration node=tars-XXX: Invalid argument I tried to reboot tars-XXX yesterday but the problem is still here. I also tried: scontrol update NodeName=ClusterNode0 State=Resume but state went back to DRAIN after a while… Does anyone have an idea of what could cause the problem? My configuration files seem correct and there really are 200G free in /local/scratch on tars-XXX… I thank you in advance for any help. Regards, Véronique -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03