Hi,

We are migrating to openmpi on our large (~1000 node) cluster, and plan
to use it exclusively on a multi-thousand core infiniband cluster in the
near future. We had extensive problems with parallel processes not dying
after a job crash, which was largely solved by switching to the slurm
resource manager.

While orterun supports slurm, it only uses the srun facility to launch
the "orted" daemons, which then start the actual user process
themselves. In our recent migration to openmpi, we have noticed
occasions where orted did not correctly clean up after a parallel job
crash. In most cases the crash was due to an infiniband error. Most
worryingly slurm was not able to cleanup the orted, and it along with
user processes were left running.

At SC07 I was told that there is some talk of using srun to launch both
orted and user processes, or alternatively use srun only. Either would
solve the cleanup problem, in our experience. Is Rolf Castain on this
list?

Thanks,
Federico

P.S.
We use proctrack/linuxproc slurm process tracking plugin. As noted in
the config man page, this may fail to find certain processes and explain
why slurm could not clean up orted effectively.

 man slurm.conf(5), version 1.2.22:
NOTE: "proctrack/linuxproc" and "proctrack/pgid" can fail to identify
all processes associated with a job since processes can become a child
of the init process (when the parent process terminates) or change their
process group. To reliably track all processes, one of the other
mechanisms utilizing kernel modifications is preferable.

Reply via email to