Hi, Several companies, including EnterpriseDB, NTT, and Postgres Pro, have developed technology that permits a block-level incremental backup to be taken from a PostgreSQL server. I believe the idea in all of those cases is that non-relation files should be backed up in their entirety, but for relation files, only those blocks that have been changed need to be backed up. I would like to propose that we should have a solution for this problem in core, rather than leaving it to each individual PostgreSQL company to develop and maintain their own solution. Generally my idea is:
1. There should be a way to tell pg_basebackup to request from the server only those blocks where LSN >= threshold_value. There are several possible ways for the server to implement this, the simplest of which is to just scan all the blocks and send only the ones that satisfy that criterion. That might sound dumb, but it does still save network bandwidth, and it works even without any prior setup. It will probably be more efficient in many cases to instead scan all the WAL generated since that LSN and extract block references from it, but that is only possible if the server has all of that WAL available or can somehow get it from the archive. We could also, as several people have proposed previously, have some kind of additional relation for that stores either a single is-modified bit -- which only helps if the reference LSN for the is-modified bit is older than the requested LSN but not too much older -- or the highest LSN for each range of K blocks, or something like that. I am at the moment not too concerned with the exact strategy we use here. I believe we may want to eventually support more than one, since they have different trade-offs. 2. When you use pg_basebackup in this way, each relation file that is not sent in its entirety is replaced by a file with a different name. For example, instead of base/16384/16417, you might get base/16384/partial.16417 or however we decide to name them. Each such file will store near the beginning of the file a list of all the blocks contained in that file, and the blocks themselves will follow at offsets that can be predicted from the metadata at the beginning of the file. The idea is that you shouldn't have to read the whole file to figure out which blocks it contains, and if you know specifically what blocks you want, you should be able to reasonably efficiently read just those blocks. A backup taken in this manner should also probably create some kind of metadata file in the root directory that stops the server from starting and lists other salient details of the backup. In particular, you need the threshold LSN for the backup (i.e. contains blocks newer than this) and the start LSN for the backup (i.e. the LSN that would have been returned from pg_start_backup). 3. There should be a new tool that knows how to merge a full backup with any number of incremental backups and produce a complete data directory with no remaining partial files. The tool should check that the threshold LSN for each incremental backup is less than or equal to the start LSN of the previous backup; if not, there may be changes that happened in between which would be lost, so combining the backups is unsafe. Running this tool can be thought of either as restoring the backup or as producing a new synthetic backup from any number of incremental backups. This would allow for a strategy of unending incremental backups. For instance, on day 1, you take a full backup. On every subsequent day, you take an incremental backup. On day 9, you run pg_combinebackup day1 day2 -o full; rm -rf day1 day2; mv full day2. On each subsequent day you do something similar. Now you can always roll back to any of the last seven days by combining the oldest backup you have (which is always a synthetic full backup) with as many newer incrementals as you want, up to the point where you want to stop. Other random points: - If the server has multiple ways of finding blocks with an LSN greater than or equal to the threshold LSN, it could make a cost-based decision between those methods, or it could allow the client to specify the method to be used. - I imagine that the server would offer this functionality through a new replication command or a syntax extension to an existing command, so it could also be used by tools other than pg_basebackup if they wished. - Combining backups could also be done destructively rather than, as proposed above, non-destructively, but you have to be careful about what happens in case of a failure. - The pg_combinebackup tool (or whatever we call it) should probably have an option to exploit hard links to save disk space; this could in particular make construction of a new synthetic full backup much cheaper. However you'd better be careful not to use this option when actually trying to restore, because if you start the server and run recovery, you don't want to change the copies of those same files that are in your backup directory. I guess the server could be taught to complain about st_nlink > 1 but I'm not sure we want to go there. - It would also be possible to collapse multiple incremental backups into a single incremental backup, without combining with a full backup. In the worst case, size(i1+i2) = size(i1) + size(i2), but if the same data is modified repeatedly collapsing backups would save lots of space. This doesn't seem like a must-have for v1, though. - If you have a SAN and are taking backups using filesystem snapshots, then you don't need this, because your SAN probably already uses copy-on-write magic for those snapshots, and so you are already getting all of the same benefits in terms of saving storage space that you would get from something like this. But not everybody has a SAN. - I know that there have been several previous efforts in this area, but none of them have gotten to the point of being committed. I intend no disrespect to those efforts. I believe I'm taking a slightly different view of the problem here than what has been done previously, trying to focus on the user experience rather than, e.g., the technology that is used to decide which blocks need to be sent. However it's possible I've missed a promising patch that takes an approach very similar to what I'm outlining here, and if so, I don't mind a bit having that pointed out to me. - This is just a design proposal at this point; there is no code. If this proposal, or some modified version of it, seems likely to be acceptable, I and/or my colleagues might try to implement it. - It would also be nice to support *parallel* backup, both for full backups as we can do them today and for incremental backups. But that sound like a separate effort. pg_combinebackup could potentially support parallel operation as well, although that might be too ambitious for v1. - It would also be nice if pg_basebackup could write backups to places other than the local disk, like an object store, a tape drive, etc. But that also sounds like a separate effort. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company