On 06/07/2018 10:22, Steffen Grunewald wrote:
On Fri, 2018-07-06 at 07:47:16 +0200, Loris Bennett wrote:
Hi Tim,
Tim Lin <timty...@gmail.com> writes:
As the title suggests, I’m searching for a way to have tighter control of which
node the batch script gets executed on. In my case it’s very hard to know which
node is best for this until after all the nodes are allocate, right before the
batch job starts . I’ve looked through all the documentation I can get my hands
on but I haven’t found any mention of any control over the batch host for
admins. Am I missing something?
As the documentation of 'sbatch' says:
"When the job allocation is finally granted for the batch script,
Slurm runs a single copy of the batch script on the first node in the
set of allocated nodes. "
I am not aware of any way of changing this.
Perhaps you can explain why you feel it is necessary for you do this.
For me, the above reads like the user has an idea of a metric for how to select
the node for rank-0 (and perhaps the code is sufficiently asymmetric to justify
such a selection), but no way to tell Slurm about it.
What about making the batch script a wrapper around the real payload, on the
"outer first node" take the list of assigned nodes and possibly reorder it, then
run the payload (via passphrase-less ssh?) on the selected, "new first" node?
Why not just use salloc instead? Allocate all the nodes for the job,
then use the script to select (ssh?) the master and start the actual job
there.
I'm still not sure why that would be necessary, though. Could you give a
clear example of the master selection process? What metric/constraint is
involved, and why can it only be obtained after node selection?
This may require changing some more environment variables, and may harm
signalling.
Okay, my suggestion reads like a terrible kludge (which it certainly is), but
AFAIK there's no way to tell Slurm about "preferred first nodes".
- S