Is anyone on the list using maintenance partitions for broken nodes? If so, how are you moving nodes between partitions?
The situation with my machines at the moment, is that we have a steady stream of new jobs coming into the queues, but broken nodes as well. I'd like to fix those broken nodes and re-add them to a separate non-production pool so that user jobs don't match, but allow me to run maintenance jobs on the nodes to prove things are working before giving them back to the users if i simply mark nodes with downnodes= or scontrol update state=drain, slurm will prevent users from new jobs, but wont allow me to run jobs on the nodes Ideally, i'd like to have a prod partition and a maint partition, where the maint partition is set to exclusiveuser and i can set the status of a node in the prod partition to drain without affecting the node status in the maint partition. I don't believe I can do this though. I believe i have to change the slurm.conf and reconfigure to add/remove nodes from one partition or the other if anyone has a better solution, i'd like to hear it.