Hi all - we’re having a new error, despite the fact that as far as I can tell we haven’t changed anything recently, and I was wondering if anyone had any ideas as to what might be going on.
The symptom is that we sometimes get an error when starting a new mpi job: Open MPI tried to bind a new process, but something went wrong. The process was killed without launching the target application. Your job will now abort. Local host: compute-1-19 Application name: XXXXXXXXX Error message: hwloc_set_cpubind returned "Error" for bitmap "0,16" Location: odls_default_module.c:499 -------------------------------------------------------------------------- 16 total processes failed to start This started happening with one particular node, although there’s nothing obviously wrong with it. Now it’s happening with another, and again nothing is obviously wrong. I haven’t quite determined whether it always happens on those nodes, but it doesn’t seem to happen much at all on other nodes. We’re running openmpi 1.7.4, which I know isn’t the newest, but it’s been working fine for months. The kernel is 2.6.32-504.8.1.el6.x86_64 from RHEL6, and the CPUs are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz If anyone has any ideas, I’d appreciate it. thanks, Noam ----------------------------------------------------------- Noam Bernstein Center for Computational Materials Science Naval Research Laboratory Code 6390 noam.bernst...@nrl.navy.mil phone: 202 404 8628
smime.p7s
Description: S/MIME cryptographic signature