Try adding —hetero-nodes to the cmd line and see if that helps resolve the problem. Of course, if all the machines are identical, then it won’t
> On Apr 29, 2015, at 1:43 PM, Brice Goglin <brice.gog...@inria.fr> wrote: > > Le 29/04/2015 22:25, Noam Bernstein a écrit : >>> On Apr 29, 2015, at 4:09 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >>> >>> Nothing wrong in that XML. I don't see what could be happening besides a >>> node rebooting with hyper-threading enabled for random reasons. >>> Please run "lstopo foo.xml" again on the node next time you get the OMPI >>> failure (assuming you get a chance to log on the node before it reboots >>> etc). >> Thanks. Do you understand why OpenMPI would even try to bind core #16? I’m >> pretty sure it was a 16 task job on a 16 (physical) core machine - shouldn’t >> it try to bind 0-15 only? >> > > If I am reading your first error correctly: > > hwloc_set_cpubind returned "Error" for bitmap "0,16" > > hwloc gave a "bitmap" containing bits 0 and 16 to OMPI, and OMPI just > tried to bind on these processors. > > Two possible reasons: > * OMPI confused some nodes: one node with more than 16 cores/threads got > such a bitmap and OMPI ended up using it for binding or another node > * hwloc generated this invalid bitmap > > Brice > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26816.php