> > > > > There is a MATPARTITIONINGHIERARCH (man page) that Fande provided that > helped scaling up problems he was working on significantly. > > Barry >
The scaling issue with DMPlex is the one-to-all pattern of communication that happens when distributing an original sequential mesh. MATPARTITIONINGHIERARCH won't fix the issue. In order to get reasonable performances when distributing a sequential mesh on a large number of processes, you need at least two stages of partitioning: an initial one from the sequential mesh to a mesh with one process per node, migrate the PLEX data, then partition on each node separately, and migrate the data again. > > On Oct 3, 2020, at 10:04 AM, Matthew Knepley <[email protected]> wrote: > > On Sat, Oct 3, 2020 at 10:51 AM Stefano Zampini <[email protected]> > wrote: > >> >> >> >>> Secondly, I'd like to add a multilevel "simple" partitioning in DMPlex >>> to optimize communication. I am thinking that I can create a mesh with >>> 'nnodes' cells and distribute that to 'nnodes*procs_node' processes with a >>> "spread" distribution. (the default seems to be "compact"). Then refine >>> that enough to get 'procs_node' more cells and the use a simple partitioner >>> again to put one cell on each process, in such a way that the locality is >>> preserved (not sure how that would work). Then refine from there on each >>> proc for a scaling study. >>> >>> >> Mark >> >> for multilevel partitioning, you need custom code, since what kills >> performances with one-to-all patterns in DMPlex is the actual communication >> of the mesh data. >> However, you can always generate a mesh to have one cell per process, and >> then refine from there. >> >> I have coded a multilevel partitioner that works quite well for >> general meshes, we have it in a private repo with Lisandro. From my >> experience, the benefits of using the multilevel scheme start from 4K >> processes on. If you plan very large runs (say > 32K cores) then you >> definitely want a multistage scheme. >> >> We never contributed the code since it requires some boilerplate code to >> run through the stages of the partitioning and move the data. >> If you are using hexas, you can always define your own "shell" >> partitioner producing box decompositions. >> > > I could integrate it if you want to stop maintaining it there :) It sounds > really useful. > > Thanks, > > Matt > > >> Another option is to generate the meshes upfront in sequential, and then >> use the parallel HDF5 reader that Vaclav and Matt put together. >> >> >>> The point here is to get communication patterns that look like an >>> (idealized) well partition application. (I suppose I could take an array of >>> factors, the product of which is the number of processors, and generalize >>> this in a loop for any number of memory levels, or make an oct-tree). >>> >>> Any thoughts? >>> Thanks, >>> Mark >>> >>> >>> >> >> -- >> Stefano >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> > > > -- Stefano
