Re: [GENERAL] why postgresql over other RDBMS

Ron Johnson Fri, 25 May 2007 23:02:34 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/25/07 23:02, Chris Browne wrote:
> [EMAIL PROTECTED] (Alvaro Herrera) writes:
>> Erik Jones wrote:
>>
>>> And, to finish up, is there any reason that pg_restore couldn't  
>>> already work with separate processes working in parallel?
>> The problem is that the ordering of objects in the dump is the only
>> thing that makes the dump consistent with regards to the dependencies of
>> objects.  So pg_restore cannot make any assumptions of parallelisability
>> of the restoration process of objects in the dump.
>>
>> pg_dump is the only one who has the dependency information.
>>
>> If that information were to be saved in the dump, then maybe pg_restore
>> could work in parallel.  But it seems a fairly non-trivial thing to do.
>>
>> Mind you, while I am idling at this idea, it seems that just having
>> multiple processes generating a dump is not such a hot idea by itself,
>> because you then have no clue on how to order the restoration of the
>> multiple files that are going to result.
> 
> I think it's less bad than you think.
> 
> The really timeconsuming bits of "pg_restore" are:
> 
> 1. the loading of table data
> 2. creating indices on those tables
> [distant] 3. setting up R/I constraints
> 
> If you look at the present structure of pg_dump output, those are all
> pretty visibly separate steps.
> 
> pg_dump output [loosely] consists of:
> - Type definitions & such
> - Table definitions
> - loading table data  (e.g. - 1)
> - stored function definitions
> - indices             (e.g. - parts of 2)
> - primary keys        (e.g. - the rest of 2)
> - triggers + rules    (including 3)
> 
> Thus, a "parallel load" would start by doing some things in a serial
> fashion, namely creating types and tables.  This isn't a
> parallelizable step, but so what?  It shouldn't take very long.


Which would be sped up by having pg_dump create multiple output files.

Of course, as I see it, this is only of real benefit when you are
using tablespaces spread across multiple RAID devices on a SAN or
multiple SCSI cards.  But then, organizations with lots of data
usually have that kind of h/w.

> The parallel load can load as many tables concurrently as you choose;
> since there are no indices or R/I triggers, those are immaterial
> factors.
> 
> Generating indices and primary keys could, again, be parallelized
> pretty heavily, and have (potentially) heavy benefit.
> 
> Furthermore, an interesting thing to do might be to use the same
> approach that Slony-I does, at present, for subscriptions.  It
> temporarily deactivates triggers and indices while loading the data,
> then reactivates them, and requests a re-index.  That would permit
> loading the *entire* schema, save for data, and then load and index
> with fairly much maximum possible efficiency.
> 
> That seems like a not-completely-frightening "SMOP" (simple matter of
> programming).  Not completely trivial, but not frighteningly
> non-trivial...

pg_dump would have to be smart enough to rationally split the data
into N number of output files, and that would get tricky
(impossible?) if most of your data is in one *huge* unpartitioned
table in a single tablespace.  Que sera.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGV8yrS9HxQb37XmcRAhEPAKDl4231rervBQO3pLHO+HwNx9dX+ACfb4Pu
qSWZNGmh/x/04QQT//nlEwI=
=zs2a
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

Re: [GENERAL] why postgresql over other RDBMS

Reply via email to