Am Wed, 29 Mar 2023 15:51:51 +0200 schrieb Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>:
> As for job scheduling, slurmctld may allocate a job to some powered-off > nodes and then calls the ResumeProgram defined in slurm.conf. From this > point it may indeed take 2-3 minutes before a node is up and running > slurmd, during which time it will have a state of POWERING_UP (see "man > sinfo"). If this doesn't complete after ResumeTimeout seconds, the node > will go into a failed state. All this logic seems to be working well. I must admit that I struggle to find a definite example in the slurmctl logs of where the ResumeTimeout was just broken. I remember that the node was recognized suspiciously right just after the timeout passed. But what we definitely had, was the issue that a node would _not_ get back up again for some reason. Slurmctld can only wait for the timeout and then tell the job that it is screwed. With my external scheme, the job can be scheduled on any number of other nodes that are becoming available. > I think we had power_save with IPMI working also in Slurm 21.08 before we > upgraded to 22.05. Good to read … so … either I had some specialsituation or misread the logs after all. I'm trying to dig again now to see a case. I got a bunch of them early on … [2022-10-03T23:22:46.664] node node329 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.664] node node330 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.664] node node331 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.665] node node332 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.665] node node333 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.665] node node334 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.665] node node335 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.665] node node336 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.666] node node337 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.666] node node338 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.666] node node339 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.666] node node340 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.666] node node341 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.667] node node342 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.667] node node343 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.667] node node344 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.667] node node345 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.669] node node346 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.670] node node347 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.670] node node348 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.670] node node349 not resumed by ResumeTimeout(900) - marking down and power_save [2022-10-03T23:22:46.676] Node node347 now responding [2022-10-03T23:22:46.676] Node node345 now responding [2022-10-03T23:22:46.676] Node node346 now responding [2022-10-03T23:22:46.676] Node node349 now responding [2022-10-03T23:22:46.676] Node node348 now responding [2022-10-03T23:22:46.676] Node node329 now responding [2022-10-03T23:22:46.693] Node node335 now responding [2022-10-03T23:22:46.712] Node node332 now responding [2022-10-03T23:22:46.712] Node node330 now responding [2022-10-03T23:22:46.712] Node node333 now responding [2022-10-03T23:22:46.713] Node node334 now responding [2022-10-03T23:22:46.713] Node node338 now responding [2022-10-03T23:22:46.713] Node node339 now responding [2022-10-03T23:22:46.713] Node node336 now responding [2022-10-03T23:22:46.713] Node node337 now responding [2022-10-03T23:22:46.713] Node node340 now responding [2022-10-03T23:22:46.713] Node node341 now responding [2022-10-03T23:22:46.713] Node node343 now responding [2022-10-03T23:22:46.713] Node node331 now responding [2022-10-03T23:22:46.713] Node node344 now responding So, quite a coincidence of node333 and friends timing out after 15 minutes and then responding within one second. I also got a more isolated occurence: [2022-10-06T18:02:02.464] node node327 not resumed by ResumeTimeout(300) - marking down and power_save [2022-10-06T18:02:47.441] Node node327 now responding Or, later, probably with the then-final version of the powersaving triggered by slurmctld, with more log lines around the event: [2022-11-18T18:37:56.437] node node355 not resumed by ResumeTimeout(300) - marking down and power_save [2022-11-18T18:37:56.437] requeue job JobId=2522959_9(2523030) due to failure of node node355 [2022-11-18T18:37:56.437] Requeuing JobId=2522959_9(2523030) [2022-11-18T18:38:07.290] Node node355 now responding [2022-11-18T18:38:07.290] node node355 returned to service [2022-11-18T18:39:08.125] error: Nodes node355 not responding [2022-11-18T18:39:52.551] _job_complete: JobId=2522959_24(2523045) WEXITSTATUS 0 [2022-11-18T18:39:52.551] _job_complete: JobId=2522959_24(2523045) done [2022-11-18T18:40:10.834] sched/backfill: _start_job: Started JobId=2522959_9(2523030) in gpd on node355 […] [2022-11-18T19:00:02.916] _job_complete: JobId=2522959_9(2523030) WEXITSTATUS 0 [2022-11-18T19:00:02.916] _job_complete: JobId=2522959_9(2523030) done Here, the job got requeued. So maybe mainly unnecessary noise. I see that the failure to start the job occured at 18:37:56.437, while the node came back 10 seconds later. Then went out a minute later, came back … maybe network weirdness. Ah, digging in my own messages to the users, I see that we had two distinct issues: 1. Jobs being aborted with NODE_FAIL because: “batch system asks to resume nodes that aren't actally suspended. This triggered a workaround [for] issues during powerdown, doing a power cycle on them.” 2. Requeued jobs like the above. So, point 1 was a bad interaction of slurmctld resume script with my mechanism of reboot. For $reasons, nodes could get stuck on shutdown (boot, actually), and my resume script cycled the power to ensure a fresh start if a node is powered on. I then modified it to only do that if there is no live instance of slurmd on the node. I never figured out why a resume request got issued for nodes that are not down, but it seems to be a thing that supposedly can happen. Point 2 could be valid behaviour for nodes that just need a bit too long … or spuriously triggered by slurmd messages for getting back into service having congestion issues. I did not investigate this further, rather avoided this tight coupling of a reboots of possibly flaky hardware (ages 6 and up) and scheduled jobs. Alrighty then, Thomas -- Dr. Thomas Orgis HPC @ Universität Hamburg