Le 29/04/2015 22:25, Noam Bernstein a écrit : >> On Apr 29, 2015, at 4:09 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >> >> Nothing wrong in that XML. I don't see what could be happening besides a >> node rebooting with hyper-threading enabled for random reasons. >> Please run "lstopo foo.xml" again on the node next time you get the OMPI >> failure (assuming you get a chance to log on the node before it reboots >> etc). > Thanks. Do you understand why OpenMPI would even try to bind core #16? I’m > pretty sure it was a 16 task job on a 16 (physical) core machine - shouldn’t > it try to bind 0-15 only? >
If I am reading your first error correctly: hwloc_set_cpubind returned "Error" for bitmap "0,16" hwloc gave a "bitmap" containing bits 0 and 16 to OMPI, and OMPI just tried to bind on these processors. Two possible reasons: * OMPI confused some nodes: one node with more than 16 cores/threads got such a bitmap and OMPI ended up using it for binding or another node * hwloc generated this invalid bitmap Brice