Re: [slurm-users] Elastic Compute on Cloud - Error Handling

Lachlan Musicman Sat, 28 Jul 2018 19:02:20 -0700

On 29 July 2018 at 04:32, Felix Wolfheimer <f.wolfhei...@googlemail.com>
wrote:


> I'm experimenting with SLURM Elastic Compute on a cloud platform. I'm
> facing the following situation: Let's say, SLURM requests that a compute
> instance is started. The ResumeProgram tries to create the instance, but
> doesn't succeed because the cloud provider can't provide the instance type
> at this point in time (happens for example if a GPU instance is
> requested, but the datacenter simply doesn't have the capacity to provide
> this instance).
> SLURM will mark the instance as "DOWN" and will not try again to request
> it. For this scenario this behavior is not optimal. Instead of marking the
> node DOWN and not trying to request it again after some time, I'd like that
> slurmctld just forgets about the failure and tries again to start the
> node. Is there any knob which can be used to achieve this behavior?
> Optimally, the behavior might be triggered by the return code of the
> ResumeProgram, e.g.,
>
> return code=0 - Node is starting up
> return code=1 - A permanent error has occurred, don't try again
> return code=2 - A temporary failure has occurred. Try again later.
>
>

I don't have an answer to your question - but I would like to know how you
manage injecting the hostname and/or IP address into slurm.conf and then
distribute it in this situation?

I have read the documentation, but it doesn't indicate a best practice in
this scenario iirc.

Is it as simple as doing those steps - wait for boot, grab hostname, inject
into slurm.conf, distribute slurm.conf to nodes, restart slurm?

Cheers
L.

Re: [slurm-users] Elastic Compute on Cloud - Error Handling

Reply via email to