On Monday, 23 April 2018 11:58:56 PM AEST Paul Edmon wrote: > I would recommend putting a clean up process in your epilog script.
Instead of that I'd recommend using cgroups to constrain processes to the resources they have requested, it has the useful side effect of being able to track all children of the job on that node. The one way some things escape is if they SSH into other nodes, to stop that use pam_slurm_adopt to capture those processes into the "extern" cgroup. More on using pam_slurm_adopt here: https://slurm.schedmd.com/pam_slurm_adopt.html > We have a check here that sees if the job completed and if so it then > terminates all the user processes by kill -9 to clean up any residuals. That can be dangerous if you permit jobs to share nodes (which is pretty standard down here in Australia) as you could end up killing processes from other jobs on that same node. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC