Yep, the question of how he has the job set up is an ongoing conversation, but for now it is staying like this and I have to make do.
Even with all the traffic he is generating though (at worst 1 a second over the course of a day) I would still have though that slurm was capable of managing that. And it was, until I did the upgrade. On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett <loris.benn...@fu-berlin.de> wrote: > Hi Byron, > > byron <lbgpub...@gmail.com> writes: > > > Hi Loris - about a second > > What is the use-case for that? Are these individual jobs or it a job > array. Either way it sounds to me like a very bad idea. On our system, > jobs which can start immediately because resources are available, still > take a few seconds to start running (I'm looking at the values for > 'submit' and 'start' from 'sacct'). If a one-second job has to wait for > just a minute, the ration of wait-time to run-time is already > disproportionately large. > > Why doesn't the user bundle these individual jobs together? Depending > on your maximum run-time and to what degree jobs can make use of > backfill, I would tell the user something between a single job and > maybe 100 job. I certainly wouldn't allow one-second jobs in any > significant numbers on our system. > > I think having a job starting every second is causing your slurmdbd to > timeout and that is the error you are seeing. > > Regards > > Loris > > > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett < > loris.benn...@fu-berlin.de> wrote: > > > > Hi Byron, > > > > byron <lbgpub...@gmail.com> writes: > > > > > Hi > > > > > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we > occasionally (3 times in 2 months) have slurmctld hanging so we get the > following message when running sinfo > > > > > > “slurm_load_jobs error: Socket timed out on send/recv operation” > > > > > > It only seems to happen when one of our users runs a job that submits > a short lived job every second for 5 days (up to 90,000 in a day). > Although that could be a red-herring. > > > > What's your definition of a 'short lived job'? > > > > > There is nothing to be found in the slurmctld log. > > > > > > Can anyone suggest how to even start troubleshooting this? Without > anything in the logs I dont know where to start. > > > > > > Thanks > > > > Cheers, > > > > Loris > > > > -- > > Dr. Loris Bennett (Herr/Mr) > > ZEDAT, Freie Universität Berlin Email > loris.benn...@fu-berlin.de > >