Hi, We are migrating to openmpi on our large (~1000 node) cluster, and plan to use it exclusively on a multi-thousand core infiniband cluster in the near future. We had extensive problems with parallel processes not dying after a job crash, which was largely solved by switching to the slurm resource manager.
While orterun supports slurm, it only uses the srun facility to launch the "orted" daemons, which then start the actual user process themselves. In our recent migration to openmpi, we have noticed occasions where orted did not correctly clean up after a parallel job crash. In most cases the crash was due to an infiniband error. Most worryingly slurm was not able to cleanup the orted, and it along with user processes were left running. At SC07 I was told that there is some talk of using srun to launch both orted and user processes, or alternatively use srun only. Either would solve the cleanup problem, in our experience. Is Rolf Castain on this list? Thanks, Federico P.S. We use proctrack/linuxproc slurm process tracking plugin. As noted in the config man page, this may fail to find certain processes and explain why slurm could not clean up orted effectively. man slurm.conf(5), version 1.2.22: NOTE: "proctrack/linuxproc" and "proctrack/pgid" can fail to identify all processes associated with a job since processes can become a child of the init process (when the parent process terminates) or change their process group. To reliably track all processes, one of the other mechanisms utilizing kernel modifications is preferable.