Over at https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com I mentioned parsing the WAL to extract block references so that incremental backup could efficiently determine which blocks needed to be copied. Ashwin replied in http://postgr.es/m/CALfoeitO-vkfjubMFQRmgyXghL-uUnZLNxbr=obrqqsm8kf...@mail.gmail.com to mention that the same approach could be useful for pg_upgrade:
# Currently, pg_rewind requires # all the WAL logs to be present on source side from point of divergence to # rewind. Instead just parse the wal and keep the changed blocks around on # sourece. Then don't need to retain the WAL but can still rewind using the # changed block map. Since there are at least two possible use case for an efficient way to learn when blocks last changed, and in each case the performance benefits of being able to learn that are potentially quite large, I'm starting this thread to brainstorm how such a system might work. It seems to me that there are basically two ways of storing this kind of information, plus a bunch of variants. One way is to store files that cover a range of LSNs, and basically contain a synopsis of the WAL for those LSNs. You omit all the actual data and just mention which blocks were changed by some record in that part of the WAL. In this type of scheme, the storage required is roughly proportional to the volume of WAL for which you wish to retain data. Pruning old data is easy; just remove the files that provide information about LSNs that you don't care about any more. The other way is to store data about each block, or each range of blocks, or all the blocks that hash onto a certain slot; for each, store the newest LSN that has modified that block, or a block in that range, or a block that hashes onto that that slot. In this system, storage is roughly proportional to the size of the database cluster, except maybe in the hashing case, but I *think* that case will degrade unless you basically expand the map to be roughly proportional to the size of the cluster anyway. I may be wrong. Of these two variants, I am inclined to prefer the version where each file is a summary of the block references within some range of LSNs. It seems simpler to implement to me. You just read a bunch of WAL files and then when you get tired you stop and emit an output file. You need to protect yourself against untimely crashes. One way is to stick a checksum into the output file. After you finish writing it, fsync() it before you start writing the next one. After a restart, read the latest such file and see if the checksum is OK. If not, regenerate it; if not, assume it's good and move on. Files other than the last one can be assumed good. Another way is to create the file with a temporary name, fsync() it, and then rename it into place and fsync() again. The background worker that generates the files can have a GUC to remove them when they are older than some threshold amount of time, or you can keep them forever and let the user manually remove stuff they no longer want based on LSN. That's pretty much it. The version where you keep an LSN per block/range/hash bucket seems more complex in terms of durability. Now you have to worry not only about creating new files, but about modifying them. That seems to add some complexity. I think it may be possible to make it work without doing write-ahead logging for every change, but it certainly needs careful thought to avoid torn page problems and checkpoint synchronization issues. Moreover, it potentially uses lots and lots of inodes if there are many relations in the cluster. You can avoid that by not creating maps for small files, if you like, or by switching to the hash bucket approach. But either of those approaches is lossy. Omitting the maps for small files means you always have to assume everything in those files changed. The hash bucket approach is vulnerable to hash collisions; you have to assume that all blocks that hash to a given bucket have changed. Those are probably manageable disadvantages, but I think they are real ones. There is one thing that does worry me about the file-per-LSN-range approach, and that is memory consumption when trying to consume the information. Suppose you have a really high velocity system. I don't know exactly what the busiest systems around are doing in terms of data churn these days, but let's say just for kicks that we are dirtying 100GB/hour. That means, roughly 12.5 million block references per hour. If each block reference takes 12 bytes, that's maybe 150MB/hour in block reference files. If you run a daily incremental backup, you've got to load all the block references for the last 24 hours and deduplicate them, which means you're going to need about 3.6GB of memory. If you run a weekly incremental backup, you're going to need about 25GB of memory. That is not ideal. One can keep the memory consumption to a more reasonable level by using temporary files. For instance, say you realize you're going to need 25GB of memory to store all the block references you have, but you only have 1GB of memory that you're allowed to use. Well, just hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno, writing each batch to a separate temporary file, and then process each of those 32 files separately. That does add some additional I/O, but it's not crazily complicated and doesn't seem too terrible, at least to me. Still, it's something not to like. Another problem to think about is whether the most recent data is going to be available when you need it. This concern applies to either approach. In the case of incremental backup, the server is going to be up and running, so if the WAL-scanner gets behind, you can just wait for it to catch up. In the case of pg_rewind, the server is going to be down, so that doesn't work. You probably need to figure out how new the data you have is, and then scan all the newer WAL to pick up any additional block references. That's a bit of a pain, but I don't see any real alternative. In the case of the file-per-LSN-range approach, it's easy to see what LSNs are covered by the files. In the case of the LSN-per-block/range/hash bucket approach, you can presumably rely on the last checkpoint having flushed all the pending changes to disk, but the WAL scanner could've been arbitrarily far behind at that point, so you'd have to store a piece of metadata someplace that tells you exactly how far the WAL scanner had progressed and then have pg_rewind fish it out. Yet another question is how to make sure WAL doesn't get removed before we finish scanning it. Peter mentioned on the other thread that we could use a variant replication slot, which immediately made me wonder why we'd need a variant. Actually, the biggest problem I see here is that if we use a replication slot, somebody might try to drop it or use it for some other purpose, and that would mess things up. I guess we could pull the usual trick of reserving names that start with 'pg_' for internal purposes. Or we could just hard-code the LSN that was last scanned for this purpose as a bespoke constraint on WAL discard. Not sure what is best. I think all of this should be optional functionality. It's going to be really useful for people with large databases, I think, but people with small databases may not care, and it won't be entirely free. If it's not enabled, then the functionality that would otherwise exploit it can fall back to doing things in a less efficient way; nothing needs to break hard. Opinions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company