Hi Slurm Users,

first time posting. I have a new slurm setup where the users can specify an amount of local node disk space they wish to use. This is a "gres" resource named "local" and it measures in GB. Once the user has scheduled a job and it gets executed, I create a folder for this job on the node and add a XFS project quota for this job with the requested amount as soft and +5% as hard limit in the node prolog. Then the users get this folder set as their $TMPDIR in the user prolog. Lastly I remove the quota and folder on job completion via the node epilog.

This all works great so far. Now I was busying myself with creating an email script, that would notify the users if the "local" was used up. Since slurm itself has no idea what the gres: local actually is and is only managing it as a number I have to do it myself. My thought was that I would check the quota on job termination in the node epilog to see where the quota is at, but Ive now ran into the snag on how to get this information to the mailprog, configured in the slurm.conf.

The arguments to that program appear to be always in this form:
-s SLURM Job_id=327 Name=ddt_clone Ended, Run time 00:05:01, COMPLETED, ExitCode 0

and the environment of the script only contains the cluster name and nothing else.

The question now becomes, how do I get information about the quota status at the end of the job from the node epilog, to the mailprog running on the head node. I can parse the jobID from the argument line to the script and thus can get all information via scontrol. So my first thought was if I could add my own data field to that output, it would solve my problem. Unfortunately I cant seem to find such an option.

Other than that Ive only come up with writing some sort of file to a shared storage mount that could be read by the mailprog.

Can you think of a more elegant solution to add this information to the job so that it can be access on the head by the mailprog with the jobid?

Any help is appreciated!

Reply via email to