On 09-08-2022 01:11, David Magda wrote:
On Aug 6, 2022, at 15:13, Chris Samuel <ch...@csamuel.org> wrote:
It's also safe to restart slurmd's with running jobs, though you may want to
drain them before that so slurmctld won't try and send them a job in the middle.
My testing has shown that this is not the case: any jobs that are running are
killed with signal 15 if I do a ’systemctl restart slurmd’ or ’service slurmd
restart’. Is there some flag in slurm.conf that allows for uninterruption of
jobs?
We have never had any issues with restarting slurmd while jobs are
running. AFAIK we don't have to configure anything to obtain this
behavior. We use RPM installation of Slurm, so maybe your /opt/slurm
link is causing problems?
When you jobs get killed as you experienced, what's logged to the node's
slurmd.log and the controller's slurmctld.log?
/Ole