Thanks, Ralph, A video would be great to accompany the slides!
I hope you have a good and productive SC16. -- bennet On Fri, Oct 28, 2016 at 8:40 PM, r...@open-mpi.org <r...@open-mpi.org> wrote: > Yes, I’ve been hearing a growing number of complaints about cgroups for that > reason. Our mapping/ranking/binding options will work with the cgroup > envelope, but it generally winds up with a result that isn’t what the user > wanted or expected. > > We always post the OMPI BoF slides on our web site, and we’ll do the same > this year. I may try to record webcast on it and post that as well since I > know it can be confusing given all the flexibility we expose. > > In case you haven’t read it yet, here is the relevant section from “man > mpirun”: > > Mapping, Ranking, and Binding: Oh My! > Open MPI employs a three-phase procedure for assigning process > locations and ranks: > > mapping Assigns a default location to each process > > ranking Assigns an MPI_COMM_WORLD rank value to each process > > binding Constrains each process to run on specific processors > > The mapping step is used to assign a default location to each process > based on the mapper being employed. Mapping by slot, node, and sequentially > results in the > assignment of the processes to the node level. In contrast, mapping by > object, allows the mapper to assign the process to an actual object on each > node. > > Note: the location assigned to the process is independent of where it > will be bound - the assignment is used solely as input to the binding > algorithm. > > The mapping of process processes to nodes can be defined not just > with general policies but also, if necessary, using arbitrary mappings that > cannot be described by > a simple policy. One can use the "sequential mapper," which reads the > hostfile line by line, assigning processes to nodes in whatever order the > hostfile specifies. > Use the -mca rmaps seq option. For example, using the same hostfile > as before: > > mpirun -hostfile myhostfile -mca rmaps seq ./a.out > > will launch three processes, one on each of nodes aa, bb, and cc, > respectively. The slot counts don't matter; one process is launched per > line on whatever node is > listed on the line. > > Another way to specify arbitrary mappings is with a rankfile, which > gives you detailed control over process binding as well. Rankfiles are > discussed below. > > The second phase focuses on the ranking of the process within the > job's MPI_COMM_WORLD. Open MPI separates this from the mapping procedure to > allow more flexibility > in the relative placement of MPI processes. This is best illustrated > by considering the following two cases where we used the —map-by ppr:2:socket > option: > > node aa node bb > > rank-by core 0 1 ! 2 3 4 5 ! 6 7 > > rank-by socket 0 2 ! 1 3 4 6 ! 5 7 > > rank-by socket:span 0 4 ! 1 5 2 6 ! 3 7 > > Ranking by core and by slot provide the identical result - a simple > progression of MPI_COMM_WORLD ranks across each node. Ranking by socket does > a round-robin rank‐ > ing within each node until all processes have been assigned an MCW > rank, and then progresses to the next node. Adding the span modifier to > the ranking directive > causes the ranking algorithm to treat the entire allocation as a > single entity - thus, the MCW ranks are assigned across all sockets before > circling back around to > the beginning. > > The binding phase actually binds each process to a given set of > processors. This can improve performance if the operating system is placing > processes suboptimally. > For example, it might oversubscribe some multi-core processor > sockets, leaving other sockets idle; this can lead processes to contend > unnecessarily for common > resources. Or, it might spread processes out too widely; this can be > suboptimal if application performance is sensitive to interprocess > communication costs. Bind‐ > ing can also keep the operating system from migrating processes > excessively, regardless of how optimally those processes were placed to begin > with. > > The processors to be used for binding can be identified in terms > of topological groupings - e.g., binding to an l3cache will bind each process > to all processors > within the scope of a single L3 cache within their assigned location. > Thus, if a process is assigned by the mapper to a certain socket, then a > —bind-to l3cache > directive will cause the process to be bound to the processors that > share a single L3 cache within that socket. > > To help balance loads, the binding directive uses a round-robin > method when binding to levels lower than used in the mapper. For example, > consider the case where a > job is mapped to the socket level, and then bound to core. Each socket > will have multiple cores, so if multiple processes are mapped to a given > socket, the binding > algorithm will assign each process located to a socket to a unique > core in a round-robin manner. > > Alternatively, processes mapped by l2cache and then bound to socket > will simply be bound to all the processors in the socket where they are > located. In this manner, > users can exert detailed control over relative MCW rank location and > binding. > > Finally, --report-bindings can be used to report bindings. > > As an example, consider a node with two processor sockets, each > comprising four cores. We run mpirun with -np 4 --report-bindings and > the following additional > options: > > % mpirun ... --map-by core --bind-to core > [...] ... binding child [...,0] to cpus 0001 > [...] ... binding child [...,1] to cpus 0002 > [...] ... binding child [...,2] to cpus 0004 > [...] ... binding child [...,3] to cpus 0008 > > % mpirun ... --map-by socket --bind-to socket > [...] ... binding child [...,0] to socket 0 cpus 000f > [...] ... binding child [...,1] to socket 1 cpus 00f0 > [...] ... binding child [...,2] to socket 0 cpus 000f > [...] ... binding child [...,3] to socket 1 cpus 00f0 > > % mpirun ... --map-by core:PE=2 --bind-to core > [...] ... binding child [...,0] to cpus 0003 > [...] ... binding child [...,1] to cpus 000c > [...] ... binding child [...,2] to cpus 0030 > [...] ... binding child [...,3] to cpus 00c0 > > % mpirun ... --bind-to none > > > Here, --report-bindings shows the binding of each process as a mask. > In the first case, the processes bind to successive cores as indicated by the > masks 0001, 0002, > 0004, and 0008. In the second case, processes bind to all cores on > successive sockets as indicated by the masks 000f and 00f0. The processes > cycle through the pro‐ > cessor sockets in a round-robin fashion as many times as are needed. > In the third case, the masks show us that 2 cores have been bound per > process. In the fourth > case, binding is turned off and no bindings are reported. > > Open MPI's support for process binding depends on the underlying > operating system. Therefore, certain process binding options may not be > available on every system. > > Process binding can also be set with MCA parameters. Their usage is > less convenient than that of mpirun options. On the other hand, MCA > parameters can be set not > only on the mpirun command line, but alternatively in a system or user > mca-params.conf file or as environment variables, as described in the MCA > section below. Some > examples include: > > mpirun option MCA parameter key value > > --map-by core rmaps_base_mapping_policy core > --map-by socket rmaps_base_mapping_policy socket > --rank-by core rmaps_base_ranking_policy core > --bind-to core hwloc_base_binding_policy core > --bind-to socket hwloc_base_binding_policy socket > --bind-to none hwloc_base_binding_policy none > > >> On Oct 28, 2016, at 4:50 PM, Bennet Fauber <ben...@umich.edu> wrote: >> >> Ralph, >> >> Alas, I will not be at SC16. I would like to hear and/or see what you >> present, so if it gets made available in alternate format, I'd >> appreciated know where and how to get it. >> >> I am more and more coming to think that our cluster configuration is >> essentially designed to frustrated MPI developers because we use the >> scheduler to create cgroups (once upon a time, cpusets) for subsets of >> cores on multisocket machines, and I think that invalidates a lot of >> the assumptions that are getting made by people who want to bind to >> particular patters. >> >> It's our foot, and we have been doing a good job of shooting it. ;-) >> >> -- bennet >> >> >> >> >> On Fri, Oct 28, 2016 at 7:18 PM, r...@open-mpi.org <r...@open-mpi.org> wrote: >>> FWIW: I’ll be presenting “Mapping, Ranking, and Binding - Oh My!” at the >>> OMPI BoF meeting at SC’16, for those who can attend. Will try to explain the >>> rationale as well as the mechanics of the options >>> >>> On Oct 11, 2016, at 8:09 AM, Dave Love <d.l...@liverpool.ac.uk> wrote: >>> >>> Gilles Gouaillardet <gil...@rist.or.jp> writes: >>> >>> Bennet, >>> >>> >>> my guess is mapping/binding to sockets was deemed the best compromise >>> from an >>> >>> "out of the box" performance point of view. >>> >>> >>> iirc, we did fix some bugs that occured when running under asymmetric >>> cpusets/cgroups. >>> >>> if you still have some issues with the latest Open MPI version (2.0.1) >>> and the default policy, >>> >>> could you please describe them ? >>> >>> >>> I also don't understand why binding to sockets is the right thing to do. >>> Binding to cores seems the right default to me, and I set that locally, >>> with instructions about running OpenMP. (Isn't that what other >>> implementations do, which makes them look better?) >>> >>> I think at least numa should be used, rather than socket. Knights >>> Landing, for instance, is single-socket, so no gets no actual binding by >>> default. >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users