Hi Loris - about a second On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <loris.benn...@fu-berlin.de> wrote:
> Hi Byron, > > byron <lbgpub...@gmail.com> writes: > > > Hi > > > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we > occasionally (3 times in 2 months) have slurmctld hanging so we get the > following message when running sinfo > > > > “slurm_load_jobs error: Socket timed out on send/recv operation” > > > > It only seems to happen when one of our users runs a job that submits a > short lived job every second for 5 days (up to 90,000 in a day). Although > that could be a red-herring. > > What's your definition of a 'short lived job'? > > > There is nothing to be found in the slurmctld log. > > > > Can anyone suggest how to even start troubleshooting this? Without > anything in the logs I dont know where to start. > > > > Thanks > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Herr/Mr) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > >