This probably isn’t very helpful, but fwiw: we added an automatic “fingerprint” capability in the later OMPI versions just to detect things like this. If the fingerprint of a backend node doesn’t match the head node, we automatically assume hetero-nodes. It isn’t foolproof, but it would have picked this one up.
Sorry you had that trouble. Ralph > On Jun 1, 2015, at 1:53 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> > wrote: > >> On Apr 30, 2015, at 1:16 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> >> wrote: >> >>> On Apr 30, 2015, at 12:03 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> The planning is pretty simple: at startup, mpirun launches a daemon on each >>> node. If —hetero-nodes is provided, each daemon returns the topology >>> discovered by hwloc - otherwise, only the first daemon does. Mpirun then >>> assigns procs to each node in a round-robin fashion (assuming you haven’t >>> told it something different). > > Now that I’ve solved my problem, I thought I’d summarize it on the list, as a > cautionary tale. I’d like to thank everyone who helped me, too. > > Basically, this information from Ralph should have clued me in, but didn’t. > It turns out that the nodes were only _supposed_ to be identical, so I > thought the —hetero-nodes things is irrelevant. As it happens, > hyperthreading got turned on on one of the nodes. When that node was the > head node, the binding mask assumed 32 (HT) cores, and when it got to a node > that only had 16 (real) cores, and no HT, it failed. Turning off HT on that > one rogue node fixed the problem. > > Things that helped make this hard to debug: > 1. The node that was problematic was not the one that failed. The node next > to it (since our scheduler by default assigns adjacent nodes) is the one that > claimed to fail. This is just the nature of the problem. > 2. Openmpi’s bindings listing appears to happen only after the binding is > complete, or maybe just the output fails to be flushed, since I never got > output (with hwloc_report_binding) from the node that was actually failing to > bind. And I didn’t know the format of the reported binding, so I didn’t know > that “BB” meant both HT virtual cores bound, and all the numbers (on the HT > node with 32 virtual cores) were 0-15, not 0-31. > > Anyway, problem solved, and thanks again for the help. > > > Noam_______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27020.php