On Tue, 14 Feb 2017, William Hay wrote:
...
Our prolog does a parallel ssh(passing through appropriate envvars) into every node assigned to the job and does the equivalent of a run-parts on a directory filled with scripts. Some of these scripts check if they are running on the head node.

(been meaning to reply to this bit for a while, sorry)

For comparison purposes, we achieved a similar result with an extensible starter_method combined with a client-side JSV.

The core starter_method is written in bash and is very basic, but almost everything can be overridden or supplemented (including the bit that actually starts the job script). It does this by reading an environment variable containing a list of shell fragments to source.

This way the job controls what is executed at launch, both on the MASTER and the SLAVEs, meaning we can easily develop on a production system simply by submitting a job that swaps in a new client-side JSV (-clear -jsv ...) - or setting an environment variable with a qsub flag.

This model has worked very well for years :)

The only thing that's broken it so far is this business of managing tmpdir space. I'm going to have to do something like your method in the epilog if I want to provide an option to copy the SLAVE tmpdir's to permanent storage at the end of the job. Annoyingly, I'd also have to stop relying upon the execd to manage the tmpdir creation/deletion as otherwise they're too ephemeral.


...
With the magic option programs permissions are left alone and
jobs only access the gpu we intend for them.  Given that this
is an option to a kernel module I assume that it is responsible
for the reset of permissions.
...

Although the magic kernel module option prevents it from happening, the strace output I looked at implies that it's actually the user process that actually resets the permissions. I want to be wrong.

Mark
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to