Re: [HACKERS] Parallel pg_dump for 9.1

Stefan Kaltenbrunner Mon, 29 Mar 2010 10:16:50 -0700

Robert Haas wrote:

On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <j...@mcknight.de> wrote:

[...]

- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type "directory" where each
table and each blob is a file in a directory, similar to the
experimental "files" archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).


I think that specifying several directories is a piece of complexity
that would be best left alone for a first version of this.  But a
single directory with multiple files sounds pretty reasonable.  Of
course we'll also need to support that format in non-parallel mode,
and in pg_restore.

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of "pg_dump | psql".


This is a very interesting idea but you might want to get the other
thing merged first, as it's going to present a different set of
issues.

I had some prior discussion with joachim (and I suspect I had someinfluence in him trying to implement that) on that.The reason why this is really needed is that the current pg_restore -jis actually a net loss(vs "pg_dump | psql") in a LOT of scenarios thatare basically "duplicate this database to that location" (or anymigration really).The example at had is a 240GB production database with around 850tables, it takes ~145min to dump that database singlethreaded(completely CPU bound), simply loading the SQL using psql canrestore it in ~150min(again CPU bound both for COPY and index creation),-j8 brings that down to ~55min.

So if  you do the math(and a bit of handwaving):

* using pg_dump | psql you get greatest(140,150) -> 150min.
* using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min

* using a theoretical parallel pg_dump and the existing parallel restoreyou would get: 50(just a guess for how fast it might be) + 55 -> 105min* a parallel dump & restore that can pipline would end up atgreatest(50,55)->55min

So a parallel dump alone would only give you a 50% speedup in total timefor doing a migration/upgrade/dump-to-devbox despite the fact that ituses 8x the resources. A piplined solution would result in a ~3x speedupin total time and you don't even have to even think about stuff thatmight be a problem like having available diskspace on thesource/destination to hold a full temporary dump(if you don't you mighteven have to add some transfer time as well).




Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel pg_dump for 9.1

Reply via email to