On Mon, Apr 25, 2011 at 2:21 PM, Simon Riggs <si...@2ndquadrant.com> wrote: >> Right, but the trick is how you identify which blocks you need to >> zero. You used the word "damaged", which to me implied that the block >> had been modified in some way but ended up with other than the >> expected contents, so that something like a CRC check might detect the >> problem. My point (as perhaps you already understand) is that you >> could easily have a situation where every block in the table passes a >> hypothetical block-level CRC check, but the table as a whole is still >> damaged because update chains aren't coherent. So you need some kind >> of mechanism for identifying which portions of the table you need to >> zero to get back to a guaranteed-coherent state. > > That sounds like progress. > > The current mechanism is "truncate complete table". There are clearly > other mechanisms that would not remove all data.
No doubt. Consider a block B. If the system crashes when block B is dirty either in the OS cache or shared_buffers, then you must zero B, or truncate it away. If it was clean in both places, however, it's good data and you can keep it. So you can imagine for example a scheme where imagine that the relation is divided into 8MB chunks, and we WAL-log the first operation after each checkpoint that touches a chunk. Replay zeroes the chunk, and we also invalidate all the indexes (the user must REINDEX to get them working again). I think that would be safe, and certainly the WAL-logging overhead would be far less than WAL-logging every change, since we'd need to emit only ~16 bytes of WAL for every 8MB written, rather than ~8MB of WAL for every 8MB written. It wouldn't allow some of the optimizations that the current unlogged tables can get away with only because they WAL-log exactly nothing - and selectively zeroing chunks of a large table might slow down startup quite a bit - but it might still be useful to someone. However, I think that the "logged table, unlogged index" idea is probably the most promising thing to think about doing first. It's easy to imagine all sorts of uses for that sort of thing even in cases where people can't afford to have any data get zeroed, and it would provide a convenient building block for something like the above if we eventually wanted to go that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers