The planning is pretty simple: at startup, mpirun launches a daemon on each node. If —hetero-nodes is provided, each daemon returns the topology discovered by hwloc - otherwise, only the first daemon does. Mpirun then assigns procs to each node in a round-robin fashion (assuming you haven’t told it something different).
Once that is done, mpirun looks at each proc that has been assigned to the node, find the next available core on that node, and computes the cpuset that would bind the proc to it. We then pass the cpuset back to the daemon on that node. When the daemon spawns the local child process, it takes the cpuset and asks hwloc to bind the proc to that cpuset. > On Apr 30, 2015, at 5:23 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil> > wrote: > >> On Apr 29, 2015, at 5:59 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Try adding —hetero-nodes to the cmd line and see if that helps resolve the >> problem. Of course, if all the machines are identical, then it won’t > > They are identical, and the problem is new. That’s what’s most mysterious > about it. > > Can anyone give me an explanation, or point me to documentation, of the > process by which the binding is planned and executed? By the way, these jobs > are all running with OMPI_MCA_hwloc_base_binding_policy=core. > > > Noam_______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/04/26819.php