On Thu, Oct 06, 2016 at 12:47:49PM +0100, Mark Dixon wrote: > On Wed, 5 Oct 2016, William Hay wrote: > ... > >Our prolog and epilog (parallel) ssh into the slave nodes and do the > >equivalent of run-parts on directories full of scripts some of which check > >if they are running on the head node of the job before doing anything. If > >we did want the epilog to save TMPDIRS from slave nodes we'd just have to > >decide how to name them I guess. > ... > > Presumably this would work for you capture-wise because you're creating your > own TMPDIRs rather than using the ones provided by the execd. (As Reuti > pointed out, the execd TMPDIRs on slave nodes are ephemeral.)
> It'd be a pity to switch to doing it that way: the execd TMPDIR can be > paired with an xfs project quota scheme which is nice and tidy. I imagine > that deleting TMPDIRs via an epilog has a greater number of failure modes, > not all of which can be avoided by purging old directories at boot, like > intermittent network problems. How has that worked for you in practice? Pretty well. The epilog is augmented by a load sensor that checks for TMPDIRs that aren't associated with a job on the node, raises an alarm and attempts a cleanup. Doesn't fire very often. > > Also, passwordless ssh between compute nodes has been useful to avoid. Not > only principle of least privilege - it's handy to help identify applications > that aren't tightly integrated. Our prolog/epilog don't run as the user and the port 22 sshd restricts who can log in (with or without password). We also use the real ssh as a wrapper around qrsh: https://github.com/UCL-RITS/GridEngine-OpenSSH Which means it is really hard for a code to avoid being tightly integrated. The prolog/epilog invoke ssh with -o ProxyCommand=none. William
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users