> On Oct 3, 2020, at 9:12 PM, Matthew Knepley <[email protected]> wrote: > > On Sat, Oct 3, 2020 at 1:49 PM Stefano Zampini <[email protected] > <mailto:[email protected]>> wrote: > > There is a MATPARTITIONINGHIERARCH (man page) that Fande provided that > helped scaling up problems he was working on significantly. > > Barry > > The scaling issue with DMPlex is the one-to-all pattern of communication that > happens when distributing an original sequential mesh. > MATPARTITIONINGHIERARCH won't fix the issue. > In order to get reasonable performances when distributing a sequential mesh > on a large number of processes, you need at least two stages of partitioning: > an initial one from the sequential mesh to a mesh with one process per node, > migrate the PLEX data, then partition on each node separately, and migrate > the data again. > > Just to make sure I understand completely, You partition a serial mesh (SELF) > onto one process per node (1PROC), and then refine, and repartition the new > mesh onto the whole machine (WORLD). Thus I > need three communicators, right? And also a method, for moving a Plex on a > subcomm onto the larger comm, using 0 parts on the new ranks.
This is how the loop over stages looks like in pseudo code DMPlexDistributeML(DM,overlap,SF* migration, DM *newdm) For i = 0 : nstages DMPlexGetPartitioner(dm,&p) PartitionerSetStage(p,i) DMPlexDistribute(dm,0,&sft,&dmt); SFCompose(sf,sft) dm = dmt; If (I = nstages and overlap > 1) DMPlexDistributeOverlap end I thought about including the refinement step within the stages, but it turns out it is not doable now if you want to get back a migration sf which is usable (and we need it to migrate our data) For a general solution, we need hooks for migrating user-defined data and to generate user-defined data while refining. The partitioner is defined with a series of MPI_Groups that identify the various processes involved per stage. But the temporary meshes that are generated in the loop are always defined on the same global communicator. > > Thanks, > > Matt > > >> On Oct 3, 2020, at 10:04 AM, Matthew Knepley <[email protected] >> <mailto:[email protected]>> wrote: >> >> On Sat, Oct 3, 2020 at 10:51 AM Stefano Zampini <[email protected] >> <mailto:[email protected]>> wrote: >> >> >> >> Secondly, I'd like to add a multilevel "simple" partitioning in DMPlex to >> optimize communication. I am thinking that I can create a mesh with 'nnodes' >> cells and distribute that to 'nnodes*procs_node' processes with a "spread" >> distribution. (the default seems to be "compact"). Then refine that enough >> to get 'procs_node' more cells and the use a simple partitioner again to put >> one cell on each process, in such a way that the locality is preserved (not >> sure how that would work). Then refine from there on each proc for a scaling >> study. >> >> >> Mark >> >> for multilevel partitioning, you need custom code, since what kills >> performances with one-to-all patterns in DMPlex is the actual communication >> of the mesh data. >> However, you can always generate a mesh to have one cell per process, and >> then refine from there. >> >> I have coded a multilevel partitioner that works quite well for general >> meshes, we have it in a private repo with Lisandro. From my experience, the >> benefits of using the multilevel scheme start from 4K processes on. If you >> plan very large runs (say > 32K cores) then you definitely want a multistage >> scheme. >> >> We never contributed the code since it requires some boilerplate code to run >> through the stages of the partitioning and move the data. >> If you are using hexas, you can always define your own "shell" partitioner >> producing box decompositions. >> >> I could integrate it if you want to stop maintaining it there :) It sounds >> really useful. >> >> Thanks, >> >> Matt >> >> Another option is to generate the meshes upfront in sequential, and then use >> the parallel HDF5 reader that Vaclav and Matt put together. >> >> The point here is to get communication patterns that look like an >> (idealized) well partition application. (I suppose I could take an array of >> factors, the product of which is the number of processors, and generalize >> this in a loop for any number of memory levels, or make an oct-tree). >> >> Any thoughts? >> Thanks, >> Mark >> >> >> >> >> -- >> Stefano >> >> >> -- >> What most experimenters take for granted before they begin their experiments >> is infinitely more interesting than any results to which their experiments >> lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/> > > > > -- > Stefano > > > -- > What most experimenters take for granted before they begin their experiments > is infinitely more interesting than any results to which their experiments > lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
