Robert Haas wrote:
On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <j...@mcknight.de> wrote:
[...]
- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type "directory" where each
table and each blob is a file in a directory, similar to the
experimental "files" archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).

I think that specifying several directories is a piece of complexity
that would be best left alone for a first version of this.  But a
single directory with multiple files sounds pretty reasonable.  Of
course we'll also need to support that format in non-parallel mode,
and in pg_restore.

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of "pg_dump | psql".

This is a very interesting idea but you might want to get the other
thing merged first, as it's going to present a different set of
issues.

I had some prior discussion with joachim (and I suspect I had some influence in him trying to implement that) on that. The reason why this is really needed is that the current pg_restore -j is actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that are basically "duplicate this database to that location" (or any migration really). The example at had is a 240GB production database with around 850 tables, it takes ~145min to dump that database single threaded(completely CPU bound), simply loading the SQL using psql can restore it in ~150min(again CPU bound both for COPY and index creation), -j8 brings that down to ~55min.
So if  you do the math(and a bit of handwaving):

* using pg_dump | psql you get greatest(140,150) -> 150min.
* using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min
* using a theoretical parallel pg_dump and the existing parallel restore you would get: 50(just a guess for how fast it might be) + 55 -> 105min * a parallel dump & restore that can pipline would end up at greatest(50,55)->55min


So a parallel dump alone would only give you a 50% speedup in total time for doing a migration/upgrade/dump-to-devbox despite the fact that it uses 8x the resources. A piplined solution would result in a ~3x speedup in total time and you don't even have to even think about stuff that might be a problem like having available diskspace on the source/destination to hold a full temporary dump(if you don't you might even have to add some transfer time as well).



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to