On Tue, 14 Feb 2017, William Hay wrote:
...
Our prolog does a parallel ssh(passing through appropriate envvars) into
every node assigned to the job and does the equivalent of a run-parts on
a directory filled with scripts. Some of these scripts check if they
are running on the head node.
(been meaning to reply to this bit for a while, sorry)
For comparison purposes, we achieved a similar result with an extensible
starter_method combined with a client-side JSV.
The core starter_method is written in bash and is very basic, but almost
everything can be overridden or supplemented (including the bit that
actually starts the job script). It does this by reading an environment
variable containing a list of shell fragments to source.
This way the job controls what is executed at launch, both on the MASTER
and the SLAVEs, meaning we can easily develop on a production system
simply by submitting a job that swaps in a new client-side JSV (-clear
-jsv ...) - or setting an environment variable with a qsub flag.
This model has worked very well for years :)
The only thing that's broken it so far is this business of managing tmpdir
space. I'm going to have to do something like your method in the epilog if
I want to provide an option to copy the SLAVE tmpdir's to permanent
storage at the end of the job. Annoyingly, I'd also have to stop relying
upon the execd to manage the tmpdir creation/deletion as otherwise they're
too ephemeral.
...
With the magic option programs permissions are left alone and
jobs only access the gpu we intend for them. Given that this
is an option to a kernel module I assume that it is responsible
for the reset of permissions.
...
Although the magic kernel module option prevents it from happening, the
strace output I looked at implies that it's actually the user process that
actually resets the permissions. I want to be wrong.
Mark
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users