On 18/03/2019 23.07, Eric Rosenberg wrote: > [2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task > failed
This usually happens for me when one of the shared filesystems is overloadedand processes are stuck in uninterruptible sleep (D), thus unableto terminate. Your reason can be different. HTH, P -- Dr. Pawel Dziekonski <pawel.dziekon...@kaust.edu.sa> KAUST Advanced Computing Core Laboratory https://www.hpc.kaust.edu.sa