Hey Steve, I think it doesn't just "power down" the nodes but deletes the instances. So then when you need a new node, it creates one, then provisions the config, then updates the slurm cluster config...
That's how I understand it, but I haven't tried running it myself. Regards, Alex On Thu, Dec 12, 2019 at 1:20 AM Steve Brasier <ste...@stackhpc.com> wrote: > Hi, I'm hoping someone can shed some light on the SchedMD-provided example > here https://github.com/SchedMD/slurm-gcp for an autoscaling cluster on > Google Cloud Plaform (GCP). > > I understand that slurm autoscaling uses the power saving interface to > create/remove nodes and the example suspend.py and resume.py scripts in the > seem pretty clear and in line with the slurm docs. However I don't > understand why the additional slurm-gcp-sync.py script is required. It > seems to compare the states of nodes as seen by google compute and slurm > and then on the GCP side either start instances or shut them down, and on > the slurm side mark them as in RESUME or DOWN states. I don't see why this > is necessary though; my understanding from the slurm docs is that e.g. the > suspend script simply has to "power down" the nodes, and slurmctld will > then mark them as in power saving mode - marking nodes down would seem to > prevent jobs being scheduled on them, which isn't what we want. Similarly, > I would have thought the resume.py script could mark nodes as in RESUME > state itself, (once it's tested that the node is up and slurmd is running > etc). > > thanks for any help > Steve >