Make tmpfs a TRES, and have NHC update that as in:

scontrol update nodename=... gres=tmpfree:$(stat -f /tmp -c "%f*%S" | bc)"

Replace /tmp with your tmpfs mount.


You'll have to define that TRES in slurm.conf and gres.conf as usual (start with count=1 and have nhc update it)


Do note that this is a simplistic example - updating like that will overwrite any other gres defined for the node, so you might wish to create an 'updategres' function that first reads in the node's current gres, only modifies the count of the fields you wish to modify, and returns a complete gres string.


In sbatch do:

sbatch --gres=tmpfree:20G

And based on last update from NHC should only consider nodes with enough tmpfree for the job.


HTH

--Dani_L.


On 9/10/19 10:15 PM, Ole Holm Nielsen wrote:
Hi Michael,

Thanks for the suggestion!  We have user requests for certain types of jobs (quantum chemistry) that require fairly large local scratch space. Our jobs normally do not have this requirement.  So unfortunately the per-node NHC check doesn't seem to do the trick.  (We already have an NHC check "check_fs_used /scratch 90%").

Best regards,
Ole


On 10-09-2019 20:41, Michael Jennings wrote:
On Monday, 02 September 2019, at 20:02:57 (+0200),
Ole Holm Nielsen wrote:

We have some users requesting that a certain minimum size of the
*Available* (i.e., free) TmpFS disk space should be present on nodes
before a job should be considered by the scheduler for a set of
nodes.

I believe that the "sbatch --tmp=size" option merely refers to the
TmpFS file system *Size* as configured in slurm.conf, and this is
*not* what users need.

For example, a job might require 50 GB of *Available disk space* on
the TmpFS file system, which may however have only 20 GB out of 100
GB *Available* as shown by the df command, the rest having been
consumed by other jobs (present or past).

However, when we do "scontrol show node <nodename>", only the TmpFS
file system *Size* is displayed as a "TmpDisk" number, but not the
*Available* number.

Question: How can we get slurmd to report back to the scheduler the
amount of *Available* disk space?  And how can users specify the
minimum *Available* disk space required by their jobs submitted by
"sbatch"?

If this is not feasible, are there other techniques that achieve the
same goal?  We're currently still at Slurm 18.08.

Hi, Ole!

I'm assuming you are wanting a per-job resolution on this rather than
per-node?  If per-node is good enough, you can of course use NHC to
check this, e.g.:
   * || check_fs_free /tmp 50GB

That doesn't work per-job, though, obviously.  Something that might
work, however, as a temporary work-around for this might be to have
the user run a single NHC command, like this:
   srun --prolog='nhc -e "check_fs_free /tmp 50GB"'

There might be some tweaks/caveats to this since NHC normally runs as
root, but just an idea....  :-)  An even crazier idea would be to set
NHC_LOAD_ONLY=1 in the environment, source /usr/sbin/nhc, and then
execute the shell function `check_fs_free` directly!  :-D

Reply via email to