On Wed, May 25, 2011 at 10:13 AM, Bruce Momjian <br...@momjian.us> wrote: >> > Idempotent does seem like the most promising idea. >> >> I tend to agree with you, but I'm worried it won't actually work out >> to a win. By the time we augment the records with enough additional >> information we may have eaten up a lot of the benefit we were hoping >> to get. > > This is where I was confused. Our bad case now is when someone modifies > one row on a page between checkpoints --- instead of writing 400 bytes, > we write 8400. What portion of between-checkpoint activity writes more > than a few rows to a page? I didn't think many, except for COPY. > Ideally we could switch in and out of this mode per page, but that seems > super-complicated.
Well, an easy to understand example would be a page that gets repeated HOT updates. We'll do this: add a tuple, add a tuple, add a tuple, add a tuple, HOT cleanup, add a tuple, add a tuple, add a tuple, add a tuple, HOT cleanup... and so on. In the worst case, that could be done many, many times between checkpoints that might be up to an hour apart. The problem can also occur (with a little more difficulty) even without HOT. Imagine a small table without lots of inserts and deletes. Page fills up, some rows are deleted, vacuum frees up space, page fills up again, some more rows are deleted, vacuum frees up space again, and so on. But you raise an interesting point, which is that it might also be possible to reduce the impact of write-ahead logging in other ways. For example, if we're doing a large COPY into a table, we could buffer up a full block of tuples and then just emit an FPI for the page. This would likely be cheaper than logging each tuple individually. In fact, you could imagine keeping a queue of pending WAL for each block in shared buffers. You don't really need that WAL to be consolidated into a single stream until either (a) you need to write the block or (b) you need to commit the transaction. When one of those things happens, you can decide at that point whether it's cheaper to emit the individual records or do some sort of consolidation. Doing it in exactly that way is probably impractical, because every backend that wants to commit would have to make a sweep of every buffer it's dirtied and see if any of them still contain WAL that needs to be shoved into the main queue, and that would probably suck, especially for small transactions. But maybe there's some variant that could be made to work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers