Dear Christopher, I tried as you suggested and increased UnkillableStepTimeout from 60 to 120 seconds, but a few hours later three of my nodes were drained with reason "Kill task failed" again. We're not using cgroups. There is a bug¹ on SchedMD's tracker describing attempts to understand this error. There they mention it possibly being related to the new X11 code in SLURM 18.08.
Regards, ¹ https://bugs.schedmd.com/show_bug.cgi?id=6307 On Thu, May 16, 2019 at 7:02 PM Christopher Samuel <ch...@csamuel.org> wrote: > On 5/16/19 1:04 AM, Alan Orth wrote: > > > but now we get a handful of nodes drained every day with reason "Kill > > task failed". In ten years of using SLURM I've never had so many > > problems as I'm having now. :\ > > We see "kill task failed" issues but as Marcus says that's not related > to X11 support, when we see it it's usually because the kernel cannot > evict dirty pages from cgroups quickly enough (or at all) for Slurm's > liking. You may want to tweak the default timeout for your > UnkillableStepTimeout from the default of 60 seconds. > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > > -- Alan Orth alan.o...@gmail.com https://picturingjordan.com https://englishbulgaria.net https://mjanja.ch "In heaven all the interesting people are missing." ―Friedrich Nietzsche