Grzegorz, sometimes when a parallel application quits there are
processes left running on the compute nodes. You can usually find
these by running 'pgrep -P 1' and excluding any processes owned by
root.
These 'orphan' processes use up memory - so if you are having problems
with applications quittin
John, thank you for your reply.
I checked the system logs and there are no signs of oom killer.
What do you mean by cleaning 'orphan' processes? Should I check if
there are any processes left after each job execution? I have always
been assuming that when mpirun terminates, everything is cleaned
Have you checked the system logs on the machines where this is running?
Is it perhaps that the processes use lots of memory and the Out Of
Memory (OOM) killer is killing them?
Also check all nodes for left-over 'orphan' processes which are still
running after a job finishes - these should be killed