Hi, all:
We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be marked as "Epilog error". Then our auto-repair program will have trouble to determine how to repair the node. Another way is call scontrol directly from epilog to drain the node, but from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote: Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). . Slurm commands in these scripts can potentially lead to performance issues and should not be used. So what is the best way to drain node from epilog with a self-defined reason, or tell slurm to add more verbose message besides "Epilog error" reason?