Hi all - we’re having a new error, despite the fact that as far as I can tell 
we haven’t changed anything recently, and I was wondering if anyone had any 
ideas as to what might be going on. 

The symptom is that we sometimes get an error when starting a new mpi job:
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:        compute-1-19
  Application name:  XXXXXXXXX
  Error message:     hwloc_set_cpubind returned "Error" for bitmap "0,16"
  Location:          odls_default_module.c:499
--------------------------------------------------------------------------
16 total processes failed to start

This started happening with one particular node, although there’s nothing 
obviously wrong with it.  Now it’s happening with another, and again nothing is 
obviously wrong. I haven’t quite determined whether it always happens on those 
nodes, but it doesn’t seem to happen much at all on other nodes.

We’re running openmpi 1.7.4, which I know isn’t the newest, but it’s been 
working fine for months.  The kernel is 2.6.32-504.8.1.el6.x86_64 from RHEL6, 
and the CPUs are 
    model name  : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

If anyone has any ideas, I’d appreciate it.

                                                                        thanks,
                                                                        Noam

-----------------------------------------------------------
Noam Bernstein
Center for Computational Materials Science
Naval Research Laboratory Code 6390

noam.bernst...@nrl.navy.mil
phone: 202 404 8628

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to