Hi, since a half year we using the suspend/resume support for Slurm. This works quite well but sometimes it breaks and no nodes are suspended or resumed anymore.
In this case we see the following message in the log: error: power_save module disabled, NULL SuspendProgram A restart of slurmctld fixes the issue for a few weeks. In the beginning we had also messages like error: power_save: program exit status of 1 So we started to implement error logging in the scripts and terminated them always with exit code. The idea was avoiding that slurms sets the SuspendProgram to NULL. But this fixed not the main error but might have reduced the frequency of occurring. Has someone observed similar issues? We will try a higher SuspendTimeout. Best, Stefan -- Stefan Stäglich, Universität Freiburg, Institut für Informatik Georges-Köhler-Allee, Geb.52, 79110 Freiburg, Germany E-Mail : staeg...@informatik.uni-freiburg.de WWW : ml.informatik.uni-freiburg.de Telefon: +49 761 203-8223
signature.asc
Description: This is a digitally signed message part.