Re: [OMPI users] what was the rationale behind rank mapping by socket?

Bennet Fauber Sat, 29 Oct 2016 05:30:41 -0700

Thanks, Ralph,

A video would be great to accompany the slides!


I hope you have a good and productive SC16.

-- bennet



On Fri, Oct 28, 2016 at 8:40 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:
> Yes, I’ve been hearing a growing number of complaints about cgroups for that 
> reason. Our mapping/ranking/binding options will work with the cgroup 
> envelope, but it generally winds up with a result that isn’t what the user 
> wanted or expected.
>
> We always post the OMPI BoF slides on our web site, and we’ll do the same 
> this year. I may try to record webcast on it and post that as well since I 
> know it can be confusing given all the flexibility we expose.
>
> In case you haven’t read it yet, here is the relevant section from “man 
> mpirun”:
>
>  Mapping, Ranking, and Binding: Oh My!
>        Open MPI employs a three-phase procedure for assigning process 
> locations and ranks:
>
>        mapping   Assigns a default location to each process
>
>        ranking   Assigns an MPI_COMM_WORLD rank value to each process
>
>        binding   Constrains each process to run on specific processors
>
>        The mapping step is used to assign a default location to each process 
> based on the mapper being employed. Mapping by slot, node,  and  sequentially 
>  results  in  the
>        assignment of the processes to the node level. In contrast, mapping by 
> object, allows the mapper to assign the process to an actual object on each 
> node.
>
>        Note: the location assigned to the process is independent of where it 
> will be bound - the assignment is used solely as input to the binding 
> algorithm.
>
>        The  mapping of process processes to nodes can be defined not just 
> with general policies but also, if necessary, using arbitrary mappings that 
> cannot be described by
>        a simple policy.  One can use the "sequential mapper," which reads the 
> hostfile line by line, assigning processes to nodes in whatever order the 
> hostfile  specifies.
>        Use the -mca rmaps seq option.  For example, using the same hostfile 
> as before:
>
>        mpirun -hostfile myhostfile -mca rmaps seq ./a.out
>
>        will  launch three processes, one on each of nodes aa, bb, and cc, 
> respectively.  The slot counts don't matter;  one process is launched per 
> line on whatever node is
>        listed on the line.
>
>        Another way to specify arbitrary mappings is with a rankfile, which 
> gives you detailed control over process binding as well.  Rankfiles are 
> discussed below.
>
>        The second phase focuses on the ranking of the process within the 
> job's MPI_COMM_WORLD.  Open MPI separates this from the mapping procedure to 
> allow more flexibility
>        in the relative placement of MPI processes. This is best illustrated 
> by considering the following two cases where we used the —map-by ppr:2:socket 
> option:
>
>                                  node aa       node bb
>
>            rank-by core         0 1 ! 2 3     4 5 ! 6 7
>
>           rank-by socket        0 2 ! 1 3     4 6 ! 5 7
>
>           rank-by socket:span   0 4 ! 1 5     2 6 ! 3 7
>
>        Ranking  by core and by slot provide the identical result - a simple 
> progression of MPI_COMM_WORLD ranks across each node. Ranking by socket does 
> a round-robin rank‐
>        ing within each node until all processes have been assigned an MCW 
> rank, and then progresses to the next node. Adding the span  modifier  to  
> the  ranking  directive
>        causes  the  ranking algorithm to treat the entire allocation as a 
> single entity - thus, the MCW ranks are assigned across all sockets before 
> circling back around to
>        the beginning.
>
>        The binding phase actually binds each process to a given set of 
> processors. This can improve performance if the operating system is placing  
> processes  suboptimally.
>        For  example,  it  might  oversubscribe  some  multi-core processor 
> sockets, leaving other sockets idle;  this can lead processes to contend 
> unnecessarily for common
>        resources.  Or, it might spread processes out too widely;  this can be 
> suboptimal if application performance is sensitive to interprocess 
> communication costs.  Bind‐
>        ing can also keep the operating system from migrating processes 
> excessively, regardless of how optimally those processes were placed to begin 
> with.
>
>        The  processors  to  be  used  for binding can be identified in terms 
> of topological groupings - e.g., binding to an l3cache will bind each process 
> to all processors
>        within the scope of a single L3 cache within their assigned location. 
> Thus, if a process is assigned by the mapper to a  certain  socket,  then  a  
> —bind-to  l3cache
>        directive will cause the process to be bound to the processors that 
> share a single L3 cache within that socket.
>
>        To  help  balance loads, the binding directive uses a round-robin 
> method when binding to levels lower than used in the mapper. For example, 
> consider the case where a
>        job is mapped to the socket level, and then bound to core. Each socket 
> will have multiple cores, so if multiple processes are mapped to a given 
> socket,  the  binding
>        algorithm will assign each process located to a socket to a unique 
> core in a round-robin manner.
>
>        Alternatively,  processes mapped by l2cache and then bound to socket 
> will simply be bound to all the processors in the socket where they are 
> located. In this manner,
>        users can exert detailed control over relative MCW rank location and 
> binding.
>
>        Finally, --report-bindings can be used to report bindings.
>
>        As an example, consider a node with two processor sockets, each 
> comprising four cores.  We run mpirun with -np  4  --report-bindings  and  
> the  following  additional
>        options:
>
>         % mpirun ... --map-by core --bind-to core
>         [...] ... binding child [...,0] to cpus 0001
>         [...] ... binding child [...,1] to cpus 0002
>         [...] ... binding child [...,2] to cpus 0004
>         [...] ... binding child [...,3] to cpus 0008
>
>         % mpirun ... --map-by socket --bind-to socket
>         [...] ... binding child [...,0] to socket 0 cpus 000f
>         [...] ... binding child [...,1] to socket 1 cpus 00f0
>         [...] ... binding child [...,2] to socket 0 cpus 000f
>         [...] ... binding child [...,3] to socket 1 cpus 00f0
>
>         % mpirun ... --map-by core:PE=2 --bind-to core
>         [...] ... binding child [...,0] to cpus 0003
>         [...] ... binding child [...,1] to cpus 000c
>         [...] ... binding child [...,2] to cpus 0030
>         [...] ... binding child [...,3] to cpus 00c0
>
>         % mpirun ... --bind-to none
>
>
>       Here, --report-bindings shows the binding of each process as a mask.  
> In the first case, the processes bind to successive cores as indicated by the 
> masks 0001, 0002,
>        0004, and 0008.  In the second case, processes bind to all cores on 
> successive sockets as indicated by the masks 000f and 00f0.  The processes 
> cycle through the pro‐
>        cessor  sockets  in a round-robin fashion as many times as are needed. 
>  In the third case, the masks show us that 2 cores have been bound per 
> process.  In the fourth
>        case, binding is turned off and no bindings are reported.
>
>        Open MPI's support for process binding depends on the underlying 
> operating system.  Therefore, certain process binding options may not be 
> available on every system.
>
>        Process binding can also be set with MCA parameters.  Their usage is 
> less convenient than that of mpirun options.  On the other hand, MCA 
> parameters can be  set  not
>        only on the mpirun command line, but alternatively in a system or user 
> mca-params.conf file or as environment variables, as described in the MCA 
> section below.  Some
>        examples include:
>
>            mpirun option          MCA parameter key         value
>
>          --map-by core          rmaps_base_mapping_policy   core
>          --map-by socket        rmaps_base_mapping_policy   socket
>          --rank-by core         rmaps_base_ranking_policy   core
>          --bind-to core         hwloc_base_binding_policy   core
>          --bind-to socket       hwloc_base_binding_policy   socket
>          --bind-to none         hwloc_base_binding_policy   none
>
>
>> On Oct 28, 2016, at 4:50 PM, Bennet Fauber <ben...@umich.edu> wrote:
>>
>> Ralph,
>>
>> Alas, I will not be at SC16.  I would like to hear and/or see what you
>> present, so if it gets made available in alternate format, I'd
>> appreciated know where and how to get it.
>>
>> I am more and more coming to think that our cluster configuration is
>> essentially designed to frustrated MPI developers because we use the
>> scheduler to create cgroups (once upon a time, cpusets) for subsets of
>> cores on multisocket machines, and I think that invalidates a lot of
>> the assumptions that are getting made by people who want to bind to
>> particular patters.
>>
>> It's our foot, and we have been doing a good job of shooting it.  ;-)
>>
>> -- bennet
>>
>>
>>
>>
>> On Fri, Oct 28, 2016 at 7:18 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>>> FWIW: I’ll be presenting “Mapping, Ranking, and Binding - Oh My!” at the
>>> OMPI BoF meeting at SC’16, for those who can attend. Will try to explain the
>>> rationale as well as the mechanics of the options
>>>
>>> On Oct 11, 2016, at 8:09 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:
>>>
>>> Gilles Gouaillardet <gil...@rist.or.jp> writes:
>>>
>>> Bennet,
>>>
>>>
>>> my guess is mapping/binding to sockets was deemed the best compromise
>>> from an
>>>
>>> "out of the box" performance point of view.
>>>
>>>
>>> iirc, we did fix some bugs that occured when running under asymmetric
>>> cpusets/cgroups.
>>>
>>> if you still have some issues with the latest Open MPI version (2.0.1)
>>> and the default policy,
>>>
>>> could you please describe them ?
>>>
>>>
>>> I also don't understand why binding to sockets is the right thing to do.
>>> Binding to cores seems the right default to me, and I set that locally,
>>> with instructions about running OpenMP.  (Isn't that what other
>>> implementations do, which makes them look better?)
>>>
>>> I think at least numa should be used, rather than socket.  Knights
>>> Landing, for instance, is single-socket, so no gets no actual binding by
>>> default.
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] what was the rationale behind rank mapping by socket?

Reply via email to