Many applications have the need to archive data after it has been through the initial flurry of reads and updates that follows its original insertion. Currently there is no specific feature support to meet this requirement, so I propose to add this for 8.4.
Use Case: VLDB with tons of (now) read only data, some not. Data needs to be accessible, but data itself is rarely touched, allowing storage costs to be minimised via a "storage hierarchy" of progressively cheaper storage. Features - Read Only Tables - Compressed Tablespaces - Attaching table - Per-Tablespace Access Costing - Performance Tuning Read-Only Tables ---------------- Postgres supports the concept of freezing tuples, so they can live forever within the database without needing further writes. Currently there is no command that will guarantee that a table has been completely frozen. This makes it difficult to reliably write data files to WORM media for longer term archiving. (WORM means Write-Once, Read-Many). It's also a pain having to VACUUM a large table again just because a small number of rows need to be frozen. So we need a DDL command that will ensure all tuples are frozen and then mark the table as read-only. Ideally, we would like to do this in a way that doesn't hold long full table locks, since we want the data to remain accessible at all times. So... VACUUM FREEZE table SET READ ONLY; would be my first thought, but I'm guessing everybody will press me towards supporting the more obvious ALTER TABLE table SET READ ONLY; This command will place a ShareLock (only) on the table, preventing anybody from writing to the table while we freeze it. The ShareLock is incompatible with any transaction that has written to the table, so when we acquire the lock all writers to the table will have completed. We then run the equivalent of a VACUUM FREEZE which will then be able to freeze *all* rows in one pass (rather than all except the most recent). On completion of the freeze pass we will then update the pg_class entry to show that it is now read-only, so we will emulate the way VACUUM does this. This form of the ALTER TABLE command will need to be mangled so it can only run outside of a transaction block and also so it takes only a ShareLock rather than an AccessExclusiveLock. Reversing the process is simpler, since we only have to turn off the flag in pg_class: ALTER TABLE table SET READ WRITE; possibly able to do this without grabbing an AccessExclusiveLock, though that isn't an important part of this implementation. Read-only tables never need VACUUMing, so we would be able to make autovacuum and explicit vacuum ignore them. Read-only tables may not be written to, yet would still allow implicit or explicit INSERT, UPDATE and DELETE privileges to be held on the table. Attempts to write to the table will result in a specific "read only table cannot be modified" ERROR. This allows a table to be placed into read-only mode for long periods of time and flipped back to read-write if some updates are required. That is useful for various performance reasons, see later. We can't use the privilege mechanism to prevent writes since superusers bypass them. (Thoughts?) Requests for tuple-level SHARE locks (e.g. FOR SHARE) against read-only tables will be ignored, since they are effectively already there. So we don't need to change the internals of the locking, nor edit the RI code to remove the call to SHARE lock referenced tables. Do this during post-parse analysis. Tables can be copied to WORM media by using ALTER TABLE table SET TABLESPACE tblspc; This would also use a ShareLock rather than an AccessExclusiveLock, piggy-backing off the work mentioned above. Running SET TABLESPACE and SET READ ONLY at the same time might sound like a good plan, but ISTM will require two fairly different code paths, so if we do it at all it will be a later addition. Compressed Tablespaces ---------------------- Frequently with large data archives there is a requirement to reduce the footprint of the data to allow longer term storage costs to be reduced. For Insert-only data we might imagine we can reduce the size of tables by removing unused tuple header information. Although that is possible, repeated headers compress fairly well, so it seems easier to tackle the problem directly by having compressed tables. Using a streaming library like zlib, it will be easy to read/write data files into a still-usable form but with much reduced size. Access to a compressed table only makes sense as a SeqScan. That would be handled by introducing tablespace-specific access costs, discussed below. Indexes on compressed tables would still be allowed, but would hardly ever be used. Access would probably be via tablespace-specific storage managers. So implement mdcompress.c alongside md.c in src/backend/storage/smgr. If that implementation route was chosen, it would then allow the compression option to be made at tablespace level, so commands would be: CREATE TABLESPACE tablespacename LOCATION 'directory' [COMPRESS]; (ALTER TABLESPACE support is going to be much more complex, so leave that alone for now) So when we copy a read-only table to another tablespace the compression would take place without additional syntax at ALTER TABLE level. i.e. nothing new goes in tblcmds.c. Cool. mdcompress.c seems fairly straightforward, though we would need to think about how to implement smgr_nblocks() since lseek-ing to get it won't work because the file size is smaller than the actual decompressed table. Perhaps with an info file that contains something like an index metapage where we can read the number of blocks. Ideas? In the first pass, I would only allow compressed read-only tables. In time we might allow Inserts, Updates and Deletes though the latter two will never be very effective. So my initial approach does *not* allow writing directly to a compressed table. A couple of reasons why that isn't desirable/easy: If we write data straight to the table then any aborted loads will be written to the table, so VACUUMing the table would need to re-compress the table which sounds horrible. Plus hint status bits would need to be set *after* the data was written. (Perhaps in conjunction with a visibility map we can just forget that aspect of it). The other problem is that blocks would need to be written out of shared buffers sequentially, which unfortunately we do not guarantee. We can force that in the smgr layer by keeping track of last written blockid and then writing all cached blocks up the currently requested one, but that seems problematic. I'd say if we want more, we do that in the next release, or at least the next phase of development. So we would prevent the direct use of CREATE TABLE on a COMPRESSED tablespace, for now. I'm open to arguments that we don't need this at all because filesystem utilities exist that do everything we need. You're experience will be good to hear about in regard to this feature. Attach ------ Writing tables on one system and then moving that data to other systems is fairly common. If we supported read-only tables then you might consider how you would publish new versions to people. For now, we just want to consider how we will upgrade from one release to another without needing to unload and reload potentially many Terabytes of data. We can't delete the old data until the new data is successfully loaded, so we will have a huge temporary storage cost. This could go very slowly if we use cheaper storage, plus reloading the data means we have to re-freeze it again also. So we need a way of attaching the old tables to the new database. We might call this binary upgrade, or we might be slightly less ambitious and talk about just moving the old read-only data. That's all I want to do at this stage. I'm mentioning this here now to see what comes out in debate, and what others are planning to work on in this area. Per-Tablespace Access Costs --------------------------- Access costs are currently set using sequential_page_cost and random_page_cost for the whole database. Read Only Tables and WORM media would make it more sensible to consider different access costs for each tablespace. This then allows tablespaces on tape, MAID, CD-ROM or just fast and slow disk etc to be catered for. For compressed filesystems, setting random_page_cost = ~100000 should do the trick for most planning issues. A new parameter, startup_page_cost might also be introduced, so we can allow for some kinds of media that have non-zero initial access times. MAID storage ~1 sec to first block, whether sequential or random. Tape storage can be ~15 seconds for robots, more for humans, even more for outsourced offsite archival companies. Performance ----------- There are a number of optimisations that are possible with read-only tables. I'm not suggesting to write all of them as part of this initial project, but it is useful to discuss them at the beginning. - Removal of empty read-only tables from query plans (particularly interesting for optimising partitioning; nice idea Greg) - Column stats can hold min & max values for all columns, allowing min() and max() aggregates to be answered from the catalog, plus those values used for constraint exclusion during planning. - pg_class.reltuples to be used directly for select count(*) - Table contents can be cached, allowing RI checks against small read-only tables to operate in a similar manner to CHECK constraints - Index-only plans would be able to trust the complete visibility of any index entry, which would speed up RI checks against large read only table The first two are certainly related to advanced partitioning, so I would be looking to work on those eventually. Example -------- When user tables get old we stamp them read only and copy them away to a compressed tablespace, which we create here also to complete the example. VACUUM FREEZE mytable SET READ ONLY; CREATE TABLESPACE archive001 LOCATION '/vldb_archive/arc001' COMPRESS; ALTER TABLE mytable SET TABLESPACE archive001; Implementation Schedule ----------------------- Roughly in the order listed above. I'm being partially sponsored for this work, but not fully. (Don't ask who by, I'm under NDA). If you're a likely user of these features and would like to sponsor me please drop me an email. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly