Am Mon, 06 Mar 2023 13:35:38 +0100 schrieb Stefan Staeglich <staeg...@informatik.uni-freiburg.de>:
> But this fixed not the main error but might have reduced the frequency of > occurring. Has someone observed similar issues? We will try a higher > SuspendTimeout. We had issues with power saving. We powered the idle nodes off, causing a full boot to resume. We observed repeatedly the strange behaviour that the node is present for a while, but only detected by slurmctld as being ready right when it is giving up with SuspendTimeout. But instead of fixing this possibly subtle logic error, we figured that a) The node suspend support in Slurm was not really designed for full power off/on, which can take minutes regularily. b) This functionality of taking nodes out of/into production is something the cluster admin does. This is not in the scope of the batch system. Hence I wrote a script that runs as a service on a superior admin node. It queries Slurm for idle nodes and pending jobs and then decides which nodes to drain and then power down or bring back online. This needs more knowledge on Slurm job and node states than I'd like, but it works. Ideally, I'd like the powersaving feature of slurm consisting of a simple interface that can communicate 1. which nodes are probably not needed in the coming x minutes/hours, depending on the job queue, with settings like keeping a minimum number of nodes idle, and 2. which nodes that are currently drained/offline it could use to satisfy user demand. I imagine that Slurm upstream is not very keen on hashing out a robust interface for that. I can see arguments for keeping this wholly internal to Slurm, but for me, taking nodes in/out of production is not directly a batch system's task. Obviously the integration of power saving that involves nodes really being powered down brings complications like the strange ResumeTimeout behaviour. Also, in the case of node that have trouble getting back online, the method inside Slurm provides for a bad user experience: The nodes are first allocated to the job, and _then_ they are powered up. In the worst case of a defective node, Slurm will wait for the whole SuspendTimeout just to realize that it doesn't really have the resources it just promised to the job, making the job run attempt fail needlessly. With my external approach, the handling of bringing a node back up is done outside slurmctld. Only after a node is back, it is undrained and jobs will be allocated on it. I use a draining with a specific reason to mark nodes that are offline due to power saving. What sucks is that I have to implement part of the scheduler in the sense that I need to match pending jobs' demands against properties of available nodes. Maybe the internal powersaving could be made more robust, but I would rather like to see more separation of concerns than putting everything into one box. Things are too intertangled, even with my simple concept of 'job' not beginning to describe what Slurm has in terms of various steps as scheduling entities that by default also use delayed allocation techniques (regarding prolog script behaviour, for example). Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg