John/Chris, Thanks for your advice. I'll need to do some reading on cgroups, I've never even been exposed to that concept. I don't even know if the SLURM setup I have access to has the cgroups or PAM plugin/modules enabled/available. Unfortunately I'm not involved in the administration of SLURM, I'm simply a user of a much larger system that's already established with other users doing compute tasks completely separate from my use case. Therefore, I'm most interested in solutions that I can implement without sys admin support on the SLURM side, which is why I started looking at the --epilog route.
I neither have the administrator access to SLURM nor the time to consider more complex approaches that might hack the Trick architecture. *Literally the only thing that isn't working for me right now is the cleanup mechanism*, everything else is working just fine. It's not as simple as killing all the simulation spawned processes, the processes themselves create message queues for internal communication that live in /dev/mqueue/ on each node, and when the sim gets a kill -9 signal, there's no internal cleanup, and those files linger on the filesystem indefinitely, causing issues in subsequent runs on those machines. >From my understanding, there's already a "master" epilog script that kills all user processes implemented in our system after a user's job completes. They have set up our SLURM nodes to be "reserved" for the user requesting them, so their greedy cleanup script isn't a problem for other compute processes, they are reserved for that single person. I might just ping the administrators and ask them to also add an 'rm /dev/mqueue/*' to that script, to me that seems like the fastest solution given what I know. I would prefer to keep that part in the "user space" since it's very specific to my use case, but srun --epilog is not behaving as I would expect. Can y'all confirm what I'm seeing is indeed what is expected to happen? ssh: ssh machine001 srun: srun --nodes 3 --epilog *cleanup.sh myProgram.exe* squeue: shows job 123 running on machine200, machine201, machine202 Kill: scancel 123 Result: myProgram.exe is terminated, cleanup.sh runs on machine001 I was expecting cleanup.sh to run on one (or all) of the compute nodes (200-202), not on the machine I launched the srun command from (001). John -- Yes we are heavily invested in the Trick framework and use their Monte-Carlo feature quite extensively, in the past we've used PBS to manage our compute nodes, but this is the first attempt to integrate Trick Monte-Carlo with SLURM. We do spacecraft simulation and analysis for various projects. On Mon, Mar 5, 2018 at 12:36 AM, John Hearns <hear...@googlemail.com> wrote: > Dan, completely off topic here. May I ask what type of simulations are you > running? > Clearly you probably have a large investment in time in Trick. > However as a fan of Julia language let me leave this link here: > https://juliaobserver.com/packages/RigidBodyDynamics > > > On 5 March 2018 at 07:31, John Hearns <hear...@googlemail.com> wrote: > >> I completely agree with what Chris says regarding cgroups. Implement >> them, and you will not regret it. >> >> I have worked with other simulation frameworks, which work in a similar >> fashion to Trick, ie a master process which spawns >> off independent worker processes on compute nodes. I am thinking on an >> internal application we have, and if I also say it Matlab. >> >> In the Trick documentation: >> <https://github.com/nasa/trick/wiki/UserGuide-Monte-Carlo#notes>Notes >> >> 1. SSH <https://en.wikipedia.org/wiki/Secure_Shell> is used to launch >> slaves across the network >> 2. Each slave machine will work in parallel with other slaves, >> greatly reducing the computation time of a simulation >> >> However I must say that there must be plenty of folks at NASA who use >> this simulation framework on HPC clusters with batch systems. >> It would surprise me that there are not 'adapation layers' available for >> Slurm, SGE, PBS etc. >> So in SLurm, you would do an sbatch which would reserve your worker nodes >> then run a series of srun which run the worker processes. >> >> (I hope I have that round the right way - I seem to recall doing srun >> then a series of sbatches in the past) >> >> But looking at the Trick Wiki quickly, I am wrong. It does seem to work >> on the model of "get a list of hosts allocated by your batch system"", >> ie the SLURM_JOB_HOSTLIST then Trick will set up simulation queues which >> spwan off models using ssh. >> Looking at the Advanced Topics guide this does seem to be so: >> https://github.com/nasa/trick/blob/master/share/doc/trick/Tr >> ick_Advanced_Topics.pdf >> The model is that you allocate up to 16 remote worker hosts for a long >> time. Then various modelling tasks are started on those hosts via ssh. >> Trick expects those hosts to be available for more tasks during your >> simulation session. >> Also there is discussion there about turning off irqbalance and cpuspeed, >> and disabling non necessary system services. >> >> >> >> >> As someone who has spent endless oodles of hours either killing orphaned >> processes on nodes, or seeing rogueprocess alarms, >> or running ps --forest to trace connections into batch job nodes which >> bypass the pbs/slurm daemons I despair slightly... >> I am probably very wrong, and NASA have excellent slurm integration. >> >> So I agree with Chris - implement cgroups, and try to make sure your ssh >> 'lands'on a cgroup. >> 'lscgroup' is a nice command to see what cgroups are active on a compte >> node. >> Also run an interactive job, ssh into one of your allocated workr nodes, >> then cat /proc/self/cgroups shows which cgroups you have landed iin. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 5 March 2018 at 02:20, Christopher Samuel <ch...@csamuel.org> wrote: >> >>> On 05/03/18 12:12, Dan Jordan wrote: >>> >>> What is the /correct /way to clean up processes across the nodes >>>> given to my program by SLURM_JOB_NODELIST? >>>> >>> >>> I'd strongly suggest using cgroups in your Slurm config to ensure that >>> processes are corralled and tracked correctly. >>> >>> You can use pam_slurm_adopt from the contrib directory to capture >>> inbound SSH sessions into a running job on the node (and deny access to >>> people who don't). >>> >>> Then Slurm should take care of everything for you without needing an >>> epilog. >>> >>> Hope this helps! >>> Chris >>> >>> >> > -- Dan Jordan