Le 29/04/2015 18:55, Noam Bernstein a écrit : >> On Apr 29, 2015, at 12:47 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >> >> Thanks. It's indeed normal that OMPI failed to bind to cpuset 0,16 since >> 16 doesn't exist at all. >> Can you run "lstopo foo.xml" on one node where it failed, and send the >> foo.xml that got generated? Just want to make sure we don't have invalid >> cpusets in there. > It’s attached. Thanks for the help, by the way. >
Nothing wrong in that XML. I don't see what could be happening besides a node rebooting with hyper-threading enabled for random reasons. Please run "lstopo foo.xml" again on the node next time you get the OMPI failure (assuming you get a chance to log on the node before it reboots etc). Brice