Hi Doug,
you could try to use auditd to catch the source.
When we used LSF in earlier times, we had an issue with one of our
prolog scripts, which killed jobs, when a job of the same user was
already on the node. auditd helped at that point to identify our own
nodecleaner script ;)
Best
Marc
Looking for advice on identifying source of a job cancellation.
Preemption is not configured on the partition. Sometimes receive a message
" Job nnn on nodexxx CANCELLED at date/time Signal SIGTERM
caugjt..." Do not see anyrhing in node logs or slurmctl logs suggesting
the source of the