Le 29/04/2015 22:25, Noam Bernstein a écrit :
>> On Apr 29, 2015, at 4:09 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>
>> Nothing wrong in that XML. I don't see what could be happening besides a
>> node rebooting with hyper-threading enabled for random reasons.
>> Please run "lstopo foo.xml" again on the node next time you get the OMPI
>> failure (assuming you get a chance to log on the node before it reboots
>> etc).
> Thanks.  Do you understand why OpenMPI would even try to bind core #16?  I’m 
> pretty sure it was a 16 task job on a 16 (physical) core machine - shouldn’t 
> it try to bind 0-15 only?
>

If I am reading your first error correctly:

hwloc_set_cpubind returned "Error" for bitmap "0,16"

hwloc gave a "bitmap" containing bits 0 and 16 to OMPI, and OMPI just
tried to bind on these processors.

Two possible reasons:
* OMPI confused some nodes: one node with more than 16 cores/threads got
such a bitmap and OMPI ended up using it for binding or another node
* hwloc generated this invalid bitmap

Brice

Reply via email to