Hello, Can you build hwloc and run lstopo on these nodes to check that everything looks similar? You have hyperthreading enabled on all nodes, and you're trying to bind processes to entire cores, right? Does 0,16 correspond to two hyperthreads within a single core on these nodes? (lstopo -p should show PU#0 and PU#16 inside the first core). thanks Brice
Le 28/04/2015 15:31, Noam Bernstein a écrit : > Hi all - we’re having a new error, despite the fact that as far as I can tell > we haven’t changed anything recently, and I was wondering if anyone had any > ideas as to what might be going on. > > The symptom is that we sometimes get an error when starting a new mpi job: > Open MPI tried to bind a new process, but something went wrong. The > process was killed without launching the target application. Your job > will now abort. > > Local host: compute-1-19 > Application name: XXXXXXXXX > Error message: hwloc_set_cpubind returned "Error" for bitmap "0,16" > Location: odls_default_module.c:499 > -------------------------------------------------------------------------- > 16 total processes failed to start > > This started happening with one particular node, although there’s nothing > obviously wrong with it. Now it’s happening with another, and again nothing > is obviously wrong. I haven’t quite determined whether it always happens on > those nodes, but it doesn’t seem to happen much at all on other nodes. > > We’re running openmpi 1.7.4, which I know isn’t the newest, but it’s been > working fine for months. The kernel is 2.6.32-504.8.1.el6.x86_64 from RHEL6, > and the CPUs are > model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > > If anyone has any ideas, I’d appreciate it. > > thanks, > Noam > > ----------------------------------------------------------- > Noam Bernstein > Center for Computational Materials Science > Naval Research Laboratory Code 6390 > > noam.bernst...@nrl.navy.mil > phone: 202 404 8628 > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26803.php