Hi all,
At the Slurm User Group I mentioned about how to tell the kernel to dump
information about stuck processes from your unkillable step script to
the kernel log buffer (seen via dmesg and hopefully syslog'd somewhere
useful for you).
echo w > /proc/sysrq-trigger
That's it.. ;-) You probably want to echo something useful to /dev/kmsg
beforehand to say what the job ID was that triggered it too.
The 'echo' will block until the kernel completes the writes, which if
you've got a lot stuck may be few seconds.
Hope this is useful!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA