Hi,
see the man page for slurm.conf:
TmpFS
Fully qualified pathname of the file system available to user jobs for
temporary storage. This parameter is used in
establishing a node's TmpDisk space. The default value is "/tmp".
So it is using /tmp. You need to change that parameter to /local/scratch and
then it should work.
Regards,
Uwe
Am 10.10.2017 um 14:09 schrieb Véronique LEGRAND:
> Hello Pierre-Marie,
>
>
>
> First, thank you for your hint.
>
> I just tried.
>
>
>
>>slurmd -C
>
> NodeName=tars-XXX CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6
> ThreadsPerCore=1 RealMemory=258373 TmpDisk=500
>
> UpTime=0-20:50:54
>
>
>
> The value for TmpDisk is erroneous. I do not know what can be the cause of
> this since the operating system df command gives the
> right values.
>
>
>
> -sh-4.1$ df -hl
>
> Filesystem Size Used Avail Use% Mounted on
>
> slash_root 3.5G 1.6G 1.9G 47% /
>
> tmpfs 127G 0 127G 0% /dev/shm
>
> tmpfs 500M 84K 500M 1% /tmp
>
> /dev/sda1 200G 33M 200G 1% /local/scratch
>
>
>
>
>
> Could slurmd be messing up tmpfs with /local/scratch?
>
>
>
> I tried the same thing on another similar node (tars-XXX-1)
>
>
>
> I got:
>
>
>
> -sh-4.1$ df -hl
>
> Filesystem Size Used Avail Use% Mounted on
>
> slash_root 3.5G 1.7G 1.8G 49% /
>
> tmpfs 127G 0 127G 0% /dev/shm
>
> tmpfs 500M 5.7M 495M 2% /tmp
>
> /dev/sda1 200G 33M 200G 1% /local/scratch
>
>
>
> and
>
>
>
> slurmd -C
>
> NodeName=tars-XXX-1 CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6
> ThreadsPerCore=1 RealMemory=258373 TmpDisk=500
>
> UpTime=101-21:34:14
>
>
>
>
>
> So, slurmd –C gives exactly the same answer but this node doesn’t go into
> DRAIN state; it works perfectly.
>
>
>
> Thank you again for your help.
>
>
>
> Regards,
>
>
>
> Véronique
>
>
>
>
>
>
>
> --
>
> Véronique Legrand
>
> IT engineer – scientific calculation & software development
>
> https://research.pasteur.fr/en/member/veronique-legrand/
>
> Cluster and computing group
>
> IT department
>
> Institut Pasteur Paris
>
> Tel : 95 03
>
>
>
>
>
> *From: *"Le Biot, Pierre-Marie" <[email protected]>
> *Reply-To: *slurm-dev <[email protected]>
> *Date: *Tuesday, 10 October 2017 at 13:53
> *To: *slurm-dev <[email protected]>
> *Subject: *[slurm-dev] RE: Node always going to DRAIN state with reason=Low
> TmpDisk
>
>
>
> Hi Véronique,
>
>
>
> Did you check the result of slurmd -C on tars-XXX ?
>
>
>
> Regards,
>
> Pierre-Marie Le Biot
>
>
>
> *From:*Véronique LEGRAND [mailto:[email protected]]
> *Sent:* Tuesday, October 10, 2017 12:02 PM
> *To:* slurm-dev <[email protected]>
> *Subject:* [slurm-dev] Node always going to DRAIN state with reason=Low
> TmpDisk
>
>
>
> Hello,
>
>
>
> I have a problem with 1 node in our cluster. It is exactly as all the other
> nodes (200 GB of temporary storage)
>
>
>
> Here is what I have in slurm.conf:
>
>
>
> # COMPUTES
>
> TmpFS=/local/scratch
>
>
>
> # NODES
>
> GresTypes=disk,gpu
>
> ReturnToService=2
>
> NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000
>
> NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373
> Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20
>
>
>
> The node that has the trouble is tars-XXX.
>
>
>
> Here is what I have in gres.conf:
>
>
>
> # Local disk space in MB (/local/scratch)
>
> NodeName=tars-[ZZZ-UUU] Name=disk Count=204000
>
>
>
> XXX is in range: [ZZZ,UUU].
>
>
>
> If I ssh to tars-XXX, here is what I get:
>
>
>
> -sh-4.1$ df -hl
>
> Filesystem Size Used Avail Use% Mounted on
>
> slash_root 3.5G 1.6G 1.9G 47% /
>
> tmpfs 127G 0 127G 0% /dev/shm
>
> tmpfs 500M 84K 500M 1% /tmp
>
> /dev/sda1 200G 33M 200G 1% /local/scratch
>
>
>
> /local/scratch is the directory for temporary storage.
>
>
>
> The problem is when I do
>
> scontrol show node tars-XXX,
>
>
>
> I get:
>
>
>
> NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6
>
> CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00
>
> AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin
>
> ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin
>
> Gres=disk:204000,gpu:0
>
> NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05
>
> OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 Boards=1
>
> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 Owner=N/A
> MCS_label=N/A
>
> BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57
>
> CapWatts=n/a
>
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> Reason=Low TmpDisk [slurm@2017-10-10T11:25:04]
>
>
>
>
>
> And in the slurmctld logs, I get the error message:
>
> 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node tars-XXX
> has low tmp_disk size (129186 < 204000)
>
> 2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error:
> _slurm_rpc_node_registration node=tars-XXX: Invalid argument
>
>
>
> I tried to reboot tars-XXX yesterday but the problem is still here.
>
> I also tried:
>
> scontrol update NodeName=ClusterNode0 State=Resume
>
> but state went back to DRAIN after a while…
>
>
>
> Does anyone have an idea of what could cause the problem? My configuration
> files seem correct and there really are 200G free in
> /local/scratch on tars-XXX…
>
>
>
> I thank you in advance for any help.
>
>
>
> Regards,
>
>
>
>
>
> Véronique
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> Véronique Legrand
>
> IT engineer – scientific calculation & software development
>
> https://research.pasteur.fr/en/member/veronique-legrand/
>
> Cluster and computing group
>
> IT department
>
> Institut Pasteur Paris
>
> Tel : 95 03
>
>
>