Try adding —hetero-nodes to the cmd line and see if that helps resolve the 
problem. Of course, if all the machines are identical, then it won’t


> On Apr 29, 2015, at 1:43 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
> 
> Le 29/04/2015 22:25, Noam Bernstein a écrit :
>>> On Apr 29, 2015, at 4:09 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>> 
>>> Nothing wrong in that XML. I don't see what could be happening besides a
>>> node rebooting with hyper-threading enabled for random reasons.
>>> Please run "lstopo foo.xml" again on the node next time you get the OMPI
>>> failure (assuming you get a chance to log on the node before it reboots
>>> etc).
>> Thanks.  Do you understand why OpenMPI would even try to bind core #16?  I’m 
>> pretty sure it was a 16 task job on a 16 (physical) core machine - shouldn’t 
>> it try to bind 0-15 only?
>> 
> 
> If I am reading your first error correctly:
> 
> hwloc_set_cpubind returned "Error" for bitmap "0,16"
> 
> hwloc gave a "bitmap" containing bits 0 and 16 to OMPI, and OMPI just
> tried to bind on these processors.
> 
> Two possible reasons:
> * OMPI confused some nodes: one node with more than 16 cores/threads got
> such a bitmap and OMPI ended up using it for binding or another node
> * hwloc generated this invalid bitmap
> 
> Brice
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26816.php

Reply via email to