On 29 July 2018 at 04:32, Felix Wolfheimer <f.wolfhei...@googlemail.com> wrote:
> I'm experimenting with SLURM Elastic Compute on a cloud platform. I'm > facing the following situation: Let's say, SLURM requests that a compute > instance is started. The ResumeProgram tries to create the instance, but > doesn't succeed because the cloud provider can't provide the instance type > at this point in time (happens for example if a GPU instance is > requested, but the datacenter simply doesn't have the capacity to provide > this instance). > SLURM will mark the instance as "DOWN" and will not try again to request > it. For this scenario this behavior is not optimal. Instead of marking the > node DOWN and not trying to request it again after some time, I'd like that > slurmctld just forgets about the failure and tries again to start the > node. Is there any knob which can be used to achieve this behavior? > Optimally, the behavior might be triggered by the return code of the > ResumeProgram, e.g., > > return code=0 - Node is starting up > return code=1 - A permanent error has occurred, don't try again > return code=2 - A temporary failure has occurred. Try again later. > > I don't have an answer to your question - but I would like to know how you manage injecting the hostname and/or IP address into slurm.conf and then distribute it in this situation? I have read the documentation, but it doesn't indicate a best practice in this scenario iirc. Is it as simple as doing those steps - wait for boot, grab hostname, inject into slurm.conf, distribute slurm.conf to nodes, restart slurm? Cheers L.