Re: [OMPI users] 3D domain decomposition with MPI

Jed Brown Sat, 13 Mar 2010 10:51:50 -0500

On Fri, 12 Mar 2010 15:06:33 -0500, Gus Correa <g...@ldeo.columbia.edu> wrote:
> Hi Cole, Jed
> 
> I don't have much direct experience with PETSc.


Disclaimer: I've been using PETSc for several years and also work on the
library itself.

> I mostly troubleshooted other people's PETSc programs,
> and observed their performance.
> What I noticed is:
> 1) PETSc's learning curve is as steep if not steeper than MPI, and

I think this depends strongly on what you want to do.  Since the library
is built on top of MPI, it's sort of trivially true since it's
beneficial for the user to be familiar with collective semantics, and
perhaps other MPI functionality depending on the level of control that
they seek.  That said, many PETSc users never call MPI directly.

> 2) PETSc codes seem to be slower (or have more overhead)
> than codes written directly in MPI.
> Jed seems to have a different perception of PETSc, though,
> and is more enthusiastic about it.
> 
> Admittedly, I don't have any direct comparison
> (i.e. the same exact code implemented via PETSc and via MPI),
> to support what I said above.

If you do find such a comparison, we'd like to see it.  We expose a
limited amount of interfaces that are known to perform/scale poorly,
because users who are not concerned about scalability ask for them so
often.  These should be clearly marked, we'll fix the docs if this is
not the case.

Note that PETSc's neighbor updates use persistant nonblocking calls by
default, but you can select alltoallw, one-sided, ready-send,
synchronous sends, and a couple other options, with and without packing
(choice at runtime).  If you know of a faster way, we'd like to see it.

Note that a default build is in debugging mode which activates lots of
integrity checks, checks for memory corruption, etc., and is usually 2
or 3 times slower than a production build (--with-debugging=0).

> OTOH, if you have a clean and good serial code already developed,
> I think it won't be a big deal to parallelize it directly
> with MPI, assuming that the core algorithm (your Gauss-Seidel solver)
> fits the remaining code in a well structured way.

This depends a lot on the structure of the serial code.  Bill Gropp had
a great quote in the last rce-cast (starts at 38:30, in response to
Brock Palen's question about what to think about when designing a
parallel program):

  I think the first thing they should keep in mind is to see whether
  they can succeed without using MPI.  After all, one of the things that
  we try to do with MPI is to encourage the development of libraries.
  All too often we see people who are reinverting "PETSc-light" instead
  of just pulling up the library and using it.  MPI enabled an entire
  parallel ecosystem for scientific software and the first thing you
  should do is see if you've already had someone else do the job for
  you.  I think after that, if you actually have to write the code, then
  you have to confront the top-down versus bottom-up.  And the next
  mistake that people make is they write the individual node code and
  then try to figure out how to glue it together to all of the other
  nodes.  And we really feel that for many applications, what you want
  to do is to start by viewing your application as a global application,
  have global data structures, figure out how you decompose it, and then
  the code to coordinate the communication between them will be pretty
  obvious.  And you can tell the difference between how an application
  was built, from whether it was top-down or bottom-up.

[...]

  You want to think about how you decompose your data structures, how
  you think about them globally.  Rather than saying, I need to think
  about everything in parallel, so I'll have all these little patches,
  and I'll compute on them, and then figure out how to stitch them
  together.  If you were building a house, you'd start with a set of
  blueprints that give you a picture of what the whole house looks like.
  You wouldn't start with a bunch of tiles and say. "Well I'll put this
  tile down on the ground, and then I'll find a tile to go next to it."
  But all too many people try to build their parallel programs by
  creating the smallest possible tiles and then trying to have the
  structure of their code emerge from the chaos of all these little
  pieces.  You have to have an organizing principle if you're going to
  survive making your code parallel.

Jed

Re: [OMPI users] 3D domain decomposition with MPI

Reply via email to