On 5/16/19 1:04 AM, Alan Orth wrote:
but now we get a handful of nodes drained every day with reason "Kill task failed". In ten years of using SLURM I've never had so many problems as I'm having now. :\
We see "kill task failed" issues but as Marcus says that's not related to X11 support, when we see it it's usually because the kernel cannot evict dirty pages from cgroups quickly enough (or at all) for Slurm's liking. You may want to tweak the default timeout for your UnkillableStepTimeout from the default of 60 seconds.
All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA