On Thu, Oct 06, 2016 at 12:47:49PM +0100, Mark Dixon wrote:
> On Wed, 5 Oct 2016, William Hay wrote:
> ...
> >Our prolog and epilog (parallel) ssh into the slave nodes and do the
> >equivalent of run-parts on directories full of scripts some of which check
> >if they are running on the head node of the job before doing anything. If
> >we did want the epilog to save TMPDIRS from slave nodes we'd just have to
> >decide how to name them I guess.
> ...
> 
> Presumably this would work for you capture-wise because you're creating your
> own TMPDIRs rather than using the ones provided by the execd. (As Reuti
> pointed out, the execd TMPDIRs on slave nodes are ephemeral.)

> It'd be a pity to switch to doing it that way: the execd TMPDIR can be
> paired with an xfs project quota scheme which is nice and tidy. I imagine
> that deleting TMPDIRs via an epilog has a greater number of failure modes,
> not all of which can be avoided by purging old directories at boot, like
> intermittent network problems. How has that worked for you in practice?

Pretty well.  The epilog is augmented by a load sensor that checks for
TMPDIRs that aren't associated with a job on the node, raises an alarm
and attempts a cleanup.  Doesn't fire very often.
> 
> Also, passwordless ssh between compute nodes has been useful to avoid. Not
> only principle of least privilege - it's handy to help identify applications
> that aren't tightly integrated.

Our prolog/epilog don't run as the user and the port 22 sshd restricts who can 
log in (with or without password).

We also use the real ssh as a wrapper around qrsh:
https://github.com/UCL-RITS/GridEngine-OpenSSH

Which means it is really hard for a code to avoid being tightly integrated.
The prolog/epilog invoke ssh with -o ProxyCommand=none.

William

Attachment: signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to