On Sat, 4 Jun 2005, Bruce Momjian wrote: > Date: Sat, 4 Jun 2005 19:33:40 -0400 (EDT) > > I have added a link to this thread on the TODO list under TODO.detail.
Thanks Bruce (and Josh Berkus) for reminding me about this. I should be able to follow this up in early July. -s > Serguei A. Mokhov wrote: > > Hello dear all, > > > > [Please CC your replies to me as I am on the digest mode] > > > > Here's finally a very high-level design proposal of the pg_upgrade feature > > I was handwaiving a couple of weeks ago. Since, I am almost done with the > > moving, I can allocate some time for this for 8.1/8.2. > > > > If this topic is of interest to you, please read on until the very end > > before flaming or bashing the ideas out. I had designed that thing and > > kept updating (the design) more or less regularly, and also reflected some > > issues from the nearby threads [1] and [2]. > > > > This design is very high-level at the moment and is not very detailed. I > > will need to figure out more stuff as I go and design some aspects in > > finer detail. I started to poke around asking for initdb-forcing code > > paths in [3], but got no response so far. But I guess if the general idea > > or, rather, ideas accepted I will insist on more information more > > aggressively :) if I can't figure something out for myself. > > > > [1] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00000.php > > [2] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00382.php > > [3] http://archives.postgresql.org/pgsql-hackers/2004-08/msg01594.php > > > > Comments are very welcome, especially _*CONSTRUCTIVE*_... > > > > Thank you, and now sit back and read... > > > > CONTENTS: > > ========= > > > > 1. The Need > > 1. Utilities and User's View of the pg_upgrade Feature > > 2. Storage Management > > - Storage Managers and the smgr API > > 3. Source Code Maintenance Aspects > > 2. The Upgrade Sequence > > 4. Proposed Implementation Plan > > - initdb() API > > - upgrade API > > > > > > 1. The Need > > ----------- > > > > It's been a problem for PG for quite awhile now to have a less painful > > upgrade procedure with every new revision of PostgreSQL, so the > > dump/restore sequence is required. That can take a while for a production > > DB, while keeping it offline. The new replication-related solutions, such > > as Slony I, pg_pool, and others can remedy the problem somewhat, but > > require to roughly double the storage requirements of a given database > > while replicating from the older server to a newer one. > > > > The proposed implementation of an in-server pg_upgrade facility attempts > > to address both issues at the same time -- a possibility to keep the > > server running and upgrading lazily w/o doubling the storage requirements > > (there will be some extra disk space taken, but far from doubling the > > size). The in-process upgrade will not take much of down time and won't > > require that much memory/disk/network resources as replication solutions > > do. > > > > > > Prerequisites > > ------------- > > > > Ideally, the (maybe not so anymore) ambitious goal is to simply be able to > > "drop in" the new binaries of the new server and kick off on the older > > version of data files. I think is this feasible now a lot more than before > > since we have those things available, which should ease up the > > implementation: > > > > - bgwriter > > - pg_autovacuum (the one to be integrated into the backend in 8.1) > > - smgr API for pluggable storage managers > > - initdb in C > > - ... > > > > initdb in C, bgwriter and pg_autovacuum, and pluggable storage manager > > have made the possibility of creation of the Upgrade Subsystem for > > PostgreSQL to be something more reasonable, complete, feasible, and sane > > to a point. > > > > > > Utilities and the User's (DBA) View of the Feature > > -------------------------------------------------- > > > > Two instances exist: > > > > pg_upgrade (in C) > > > > A standalone utility to upgrade the binary on-disk format from one > > version to another when the database is offline. > > We should always have this as an option. > > pg_upgrade will accept sub/super set of pg_dump(all)/pg_restore > > options that do not require a connection. I haven't > > thought through this in detail yet. > > > > pg_autoupgrade > > > > a postgres subprocess, modeled after bgwriter and pg_autovacuum > > daemons. This will work when the database system is running > > on old data directory, and lazily converting relations to the new > > format. > > > > pg_autoupgrade daemon can be triggered by the following events in addition > > to the lazy upgrade process: > > > > "SQL" level: UPGRADE <ALL | relation_name [, relation_name]> [NOW | time] > > > > While the database won't be offline running over older database files, > > SELECT/read-only queries would be allowed using older storage managers*. > > Any write operation on old data will act using write-invalidate approach > > that will force the upgrade the affected relations to the new format to be > > scheduled after the relation-in-progress. > > > > (* See the "Storage Management" section.) > > > > Availability of the relations while upgrade is in progress is likely to be > > the same as in VACUUM FULL for that relation, i.e. the entire relation is > > locked until the upgrade is complete. Maybe we could optimize that by > > locking only particular pages of relations, I have to figure that out. > > > > The upgrade of indices can be done using REINDEX, which seems far less > > complicated than trying to convert its on-disk representation. This has > > to be done after the relation is converted. Alternatively, the index > > upgrade can simply be done by "CREATE INDEX" after the upgrade of > > relations. > > > > The relations to be upgraded are ordered according to some priority, e.g. > > system relations being first, then user-owned relations. System relations > > upgrade is forced upon the postmaster startup, and then user relations are > > processed lazily. > > > > So, in a sense, pg_autoupgrade will act like a proxy choosing appropriate > > storage manager (like a driver) between the new server and the old data > > file upgrading them on-demand. For that purpose we might need to add a > > pg_upgradeproxy to intercept backend requests and use appropriate storage > > manager. There will be one proxy process per backend. > > > > > > Storage Management > > ================== > > > > Somebody has made a possibility to plug a different storage manager in > > postgres and we even had two of them at some point . for the magnetic disk > > and the main memory. The main memory one is gone, but the smgr API is > > still there. Some were dubious why we would ever need another third-party > > storage manager, but here I propose to "plug in" storage managers from the > > older Postgres versions itself! Here is where the pluggable storage > > manager API would be handy once fully resurrected. Instead of trying to > > plug some third party storage managers it will primarily be used by the > > storage managers of different versions of Postgres. > > > > We can take the storage manager code from the past maintenance releases, > > namely 6.5.3, 7.0.3, 7.1.3, 7.2.5, 7.3.7, 7.4.5, and 8.0, and arrange them > > in appropriate fashion and have them implement the API properly. Anyone > > can contribute a storage manager as they see fit, there's no need to get > > them all at once. As a trial implementation I will try to do the last > > three or four maybe. > > > > > > Where to put relations being upgraded? > > -------------------------------------- > > > > At the beginning of the upgrade process if pg detects the old version of > > data files, it moves them under $PGDATA/<ver>, and keeps the old relations > > there until upgraded. The relations to be upgraded will be kept in the > > pg_upgrade_catalog. Once all relations upgraded, the <ver> directory is > > removed and the auto and proxy processes are shut down. The contents of > > the pg_upgrade_catalog emptied. The only issue remains is how to deal > > with tablespaces (or LOCATION in 7.* releases) elsewhere .- this can > > probably be addressed in the similar fashion, but having a > > /my/tablespace/<ver> directory. > > > > Source Code Maintenance > > ======================= > > > > Now, after the above some of you may get scared on the amount of similar > > code to possibly maintain in all those storage managers, but in reality > > they would require as much maintenance as the corresponding releases do > > get back-patched in that code area, and some are not being maintained for > > quite some time already. Plus, I should be around to maintain it, should > > this become realized. > > > > Release-time Maintenance > > ------------------------ > > > > For maintenance of pg_upgrade itself, one will have to fork out a new > > storage manager from the previous stable release and "register" it within > > the system. Alternatively, the new storage manager can be forked when the > > new release cycle begins. Additionally, a pg_upgrade version has to be > > added implementing the API steps outlined in the pg_upgrade API section. > > > > > > Implementation Steps > > ==================== > > > > To materialize the above idea, I'd proceed as follows: > > > > *) Provide the initdb() API (quick) > > > > *) Resurrect the pluggable storage manager API to be usable for the > > purpose. > > > > *) Document it > > > > *) Implement pg_upgrade API for 8.0 and 7.4.5. > > > > *) Extract 8.0 and 7.4.5 storage managers and have them implement the API > > as a proof of concept. Massage the API as needed. > > > > *) Document the process of adding new storage managers and pg_upgrade > > drivers. > > > > *) Extract other versions storage managers. > > > > > > pg_upgrade sequence > > ------------------- > > > > pg_upgrade API for the steps below to update for the next release. > > > > What to do with WAL?? Maybe upgrade can simply be done using WAL replay > > with old WAL manager? Not, fully, because not everything is in WAL, but > > some WAL recovery maybe needed in case the server was not shutdown cleanly > > before the upgrade. > > > > pg_upgrade will proceed as follows: > > > > - move PGDATA to PGDATA/<major pg version> > > - move tablespaces likewise > > - optional recovery from WAL in case old server was not shutdown properly > > -? Shall I upgrade PITR logs of 8.x??? So one can recover to a > > point-in-time in the upgraded database? > > - CLUSTER all old data > > - ANALYZE all old data > > - initdb() new system catalogs > > - Merge in modifications from old system catalogs > > - upgrade schemas/users > > -- variations > > - upgrade user relations > > > > Upgrade API: > > ------------ > > > > First draft, to be refined multiple times, but to convey the ideas behind: > > > > moveData() > > movePGData() > > moveTablespaces() 8.0+ > > moveDbLocation() < 8.0 > > > > preliminaryRecovery() > > - WAL?? > > - PITR 8.0+?? > > > > preliminaryCleanup() > > CLUSTER -- recover some dead space > > ANALYZE -- gives us stats > > > > upgradeSystemInfo() > > initdb() > > mergeOldCatalogs() > > mergeOldTemplates() > > > > upgradeUsers() > > > > upgradeSchemas() > > - > 7.2, else NULL > > > > upgradeUserRelations() > > upgradeIndices() > > DROP/CREATE > > > > upgradeInit() > > { > > > > } > > > > The main body in pseudocode: > > > > upgradeLoop() > > { > > moveData(); > > preliminaryRecovery(); > > preliminaryCleanup(); > > upgradeSystemInfo(); > > upgradeUsers(); > > upgradeSchemas(); > > upgradeUserRelations(); > > } > > > > Something along these lines the API would be: > > > > typedef struct t_upgrade > > { > > bool (*moveData) (void); > > bool (*preliminaryRecovery) (void); /* may be NULL > > */ > > bool (*preliminaryCleanup) (void); /* may be NULL > > */ > > bool (*upgradeSystemInfo) (void); /* may be NULL > > */ > > bool (*upgradeUsers) (void); /* may be NULL */ > > bool (*upgradeSchemas) (void); /* may be NULL > > */ > > bool (*upgradeUserRelations) (void); /* may be NULL > > */ > > } t_upgrade; > > > > > > The above sequence is executed by either pg_upgrade utility uninterrupted > > or by the pg_autoupgrade daemon. In the former the upgrade priority is > > simply by OID, in the latter also, but can be overridden by the user using > > the UPGRADE command to schedule relations upgrade, write operation can > > also change such schedule, with user's selected choice to be first. The > > more write requests a relation receives while in the upgrade queue, its > > priority increases; thus, the relation with most hits is on top. In case > > of tie, OID is the decision mark. > > > > Some issues to look into: > > > > - catalog merger > > - a crash in the middle of upgrade > > - PITR logs for 8.x+ > > - ... > > > > Flames and Handwaiving > > ---------------------- > > > > Okay, flame is on, but before you flame, mind you, this is a very initial > > version of the design. Some of the ideas may seem far fetched, the > > contents may seem messy, but I believe it's now more doable than ever and > > I am willing to put effort in it for the next release or two and then > > maintain it afterwards. It's not going to be done in one shot maybe, but > > incrementally, using input, feedback, and hints from you, guys. > > > > Thank you for reading till this far :-) I.d like to hear from you if any > > of this made sense to you. > > > > Truly yours, -- Serguei A. Mokhov | /~\ The ASCII Computer Science Department | \ / Ribbon Campaign Concordia University | X Against HTML Montreal, Quebec, Canada | / \ Email! ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly