The planning is pretty simple: at startup, mpirun launches a daemon on each 
node. If —hetero-nodes is provided, each daemon returns the topology discovered 
by hwloc - otherwise, only the first daemon does. Mpirun then assigns procs to 
each node in a round-robin fashion (assuming you haven’t told it something 
different).

Once that is done, mpirun looks at each proc that has been assigned to the 
node, find the next available core on that node, and computes the cpuset that 
would bind the proc to it. We then pass the cpuset back to the daemon on that 
node.

When the daemon spawns the local child process, it takes the cpuset and asks 
hwloc to bind the proc to that cpuset.


> On Apr 30, 2015, at 5:23 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil> 
> wrote:
> 
>> On Apr 29, 2015, at 5:59 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Try adding —hetero-nodes to the cmd line and see if that helps resolve the 
>> problem. Of course, if all the machines are identical, then it won’t
> 
> They are identical, and the problem is new.  That’s what’s most mysterious 
> about it.  
> 
> Can anyone give me an explanation, or point me to documentation, of the 
> process by which the binding is planned and executed?  By the way, these jobs 
> are all running with OMPI_MCA_hwloc_base_binding_policy=core.
> 
>                                                                               
>         Noam_______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/04/26819.php

Reply via email to