> On Apr 30, 2015, at 1:16 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> > wrote: > >> On Apr 30, 2015, at 12:03 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> The planning is pretty simple: at startup, mpirun launches a daemon on each >> node. If —hetero-nodes is provided, each daemon returns the topology >> discovered by hwloc - otherwise, only the first daemon does. Mpirun then >> assigns procs to each node in a round-robin fashion (assuming you haven’t >> told it something different).
Now that I’ve solved my problem, I thought I’d summarize it on the list, as a cautionary tale. I’d like to thank everyone who helped me, too. Basically, this information from Ralph should have clued me in, but didn’t. It turns out that the nodes were only _supposed_ to be identical, so I thought the —hetero-nodes things is irrelevant. As it happens, hyperthreading got turned on on one of the nodes. When that node was the head node, the binding mask assumed 32 (HT) cores, and when it got to a node that only had 16 (real) cores, and no HT, it failed. Turning off HT on that one rogue node fixed the problem. Things that helped make this hard to debug: 1. The node that was problematic was not the one that failed. The node next to it (since our scheduler by default assigns adjacent nodes) is the one that claimed to fail. This is just the nature of the problem. 2. Openmpi’s bindings listing appears to happen only after the binding is complete, or maybe just the output fails to be flushed, since I never got output (with hwloc_report_binding) from the node that was actually failing to bind. And I didn’t know the format of the reported binding, so I didn’t know that “BB” meant both HT virtual cores bound, and all the numbers (on the HT node with 32 virtual cores) were 0-15, not 0-31. Anyway, problem solved, and thanks again for the help. Noam
smime.p7s
Description: S/MIME cryptographic signature