-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/25/07 23:02, Chris Browne wrote: > [EMAIL PROTECTED] (Alvaro Herrera) writes: >> Erik Jones wrote: >> >>> And, to finish up, is there any reason that pg_restore couldn't >>> already work with separate processes working in parallel? >> The problem is that the ordering of objects in the dump is the only >> thing that makes the dump consistent with regards to the dependencies of >> objects. So pg_restore cannot make any assumptions of parallelisability >> of the restoration process of objects in the dump. >> >> pg_dump is the only one who has the dependency information. >> >> If that information were to be saved in the dump, then maybe pg_restore >> could work in parallel. But it seems a fairly non-trivial thing to do. >> >> Mind you, while I am idling at this idea, it seems that just having >> multiple processes generating a dump is not such a hot idea by itself, >> because you then have no clue on how to order the restoration of the >> multiple files that are going to result. > > I think it's less bad than you think. > > The really timeconsuming bits of "pg_restore" are: > > 1. the loading of table data > 2. creating indices on those tables > [distant] 3. setting up R/I constraints > > If you look at the present structure of pg_dump output, those are all > pretty visibly separate steps. > > pg_dump output [loosely] consists of: > - Type definitions & such > - Table definitions > - loading table data (e.g. - 1) > - stored function definitions > - indices (e.g. - parts of 2) > - primary keys (e.g. - the rest of 2) > - triggers + rules (including 3) > > Thus, a "parallel load" would start by doing some things in a serial > fashion, namely creating types and tables. This isn't a > parallelizable step, but so what? It shouldn't take very long.
Which would be sped up by having pg_dump create multiple output files. Of course, as I see it, this is only of real benefit when you are using tablespaces spread across multiple RAID devices on a SAN or multiple SCSI cards. But then, organizations with lots of data usually have that kind of h/w. > The parallel load can load as many tables concurrently as you choose; > since there are no indices or R/I triggers, those are immaterial > factors. > > Generating indices and primary keys could, again, be parallelized > pretty heavily, and have (potentially) heavy benefit. > > Furthermore, an interesting thing to do might be to use the same > approach that Slony-I does, at present, for subscriptions. It > temporarily deactivates triggers and indices while loading the data, > then reactivates them, and requests a re-index. That would permit > loading the *entire* schema, save for data, and then load and index > with fairly much maximum possible efficiency. > > That seems like a not-completely-frightening "SMOP" (simple matter of > programming). Not completely trivial, but not frighteningly > non-trivial... pg_dump would have to be smart enough to rationally split the data into N number of output files, and that would get tricky (impossible?) if most of your data is in one *huge* unpartitioned table in a single tablespace. Que sera. - -- Ron Johnson, Jr. Jefferson LA USA Give a man a fish, and he eats for a day. Hit him with a fish, and he goes away for good! -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGV8yrS9HxQb37XmcRAhEPAKDl4231rervBQO3pLHO+HwNx9dX+ACfb4Pu qSWZNGmh/x/04QQT//nlEwI= =zs2a -----END PGP SIGNATURE----- ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly