On Tue, Apr 9, 2019 at 5:28 PM Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > On 2019-Apr-09, Peter Eisentraut wrote: > > On 2019-04-09 17:48, Robert Haas wrote: > > > 3. There should be a new tool that knows how to merge a full backup > > > with any number of incremental backups and produce a complete data > > > directory with no remaining partial files. > > > > Are there by any chance standard file formats and tools that describe a > > binary difference between directories? That would be really useful here. > > VCDIFF? https://tools.ietf.org/html/rfc3284
I don't understand VCDIFF very well, but I see some potential problems with going in this direction. First, suppose we take a full backup on Monday. Then, on Tuesday, we want to take an incremental backup. In my proposal, the backup server only needs to provide the database with one piece of information: the start-LSN of the previous backup. The server determines which blocks are recently modified and sends them to the client, which stores them. The end. On the other hand, storing a maximally compact VCDIFF seems to require that, for each block modified in the Tuesday backup, we go read the corresponding block as it existed on Monday. Assuming that the server is using some efficient method of locating modified blocks, this will approximately double the amount of read I/O required to complete the backup: either the server or the client must now read not only the current version of the block but the previous versions. If the previous backup is an incremental backup that does not contain full block images but only VCDIFF content, whoever is performing the VCDIFF calculation will need to walk the entire backup chain and reconstruct the previous contents of the previous block so that it can compute the newest VCDIFF. A customer who does an incremental backup every day and maintains a synthetic full backup from 1 week prior will see a roughly eightfold increase in read I/O compared to the design I proposed. The same problem exists at restore time. In my design, the total read I/O required is equal to the size of the database, plus however much metadata needs to be read from older delta files -- and that should be fairly small compared to the actual data being read, at least in normal, non-extreme cases. But if we are going to proceed by applying a series of delta files, we're going to need to read every older backup in its entirety. If the turnover percentage is significant, say 20%/day, and if the backup chain is say 7 backups long to get back to a full backup, this is a huge difference. Instead of having to read ~100% of the database size, as in my proposal, we'll need to read 100% + (6 * 20%) = 220% of the database size. Since VCDIFF uses an add-copy-run language to described differences, we could try to work around the problem that I just described by describing each changed data block as an 8192-byte add, and unchanged blocks as an 8192-byte copy. If we did that, then I think that the problem at backup time goes away: we can write out a VCDIFF-format file for the changed blocks based just on knowing that those are the blocks that have changed, without needing to access the older file. Of course, if we do it this way, the file will be larger than it would be if we actually compared the old and new block contents and wrote out a minimal VCDIFF, but it does make taking a backup a lot simpler. Even with this proposal, though, I think we still have trouble with restore time. I proposed putting the metadata about which blocks are included in a delta file at the beginning of the file, which allows a restore of a new incremental backup to relatively efficiently flip through older backups to find just the blocks that it needs, without having to read the whole file. But I think (although I am not quite sure) that in the VCDIFF format, the payload for an ADD instruction is stored near the payload. The result would be that you'd have to basically read the whole file at restore time to figure out which blocks were available from that file and which ones needed to be retrieved from an older backup. So while this approach would fix the backup-time problem, I believe that it would still require significantly more read I/O at restore time than my proposal. Furthermore, if, at backup time, we have to do anything that requires access to the old data, either the client or the server needs to have access to that data. Nonwithstanding the costs of reading it, that doesn't seem very desirable. The server is quite unlikely to have access to the backups, because most users want to back up to a different server in order to guard against a hardware failure. The client is more likely to be running on a machine where it has access to the data, because many users back up to the same machine every day, so the machine that is taking the current backup probably has the older one. However, accessing that old backup might not be cheap. It could be located in an object store in the cloud someplace, or it could have been written out to a tape drive and the tape removed from the drive. In the design I'm proposing, that stuff doesn't matter, but if you want to run diffs, then it does. Even if the client has efficient access to the data and even if it has so much read I/O bandwidth that the costs of reading that old data to run diffs doesn't matter, it's still pretty awkward for a tar-format backup. The client would have to take the tar archive sent by the server apart and form a new one. Another advantage of storing whole blocks in the incremental backup is that there's no tight coupling between the full backup and the incremental backup. Suppose you take a full backup A on S1, and then another full backup B, and then an incremental backup C based on A, and then an incremental backup D based on B. If backup B is destroyed beyond retrieval, you can restore the chain A-C-D and get back to the same place that restoring B-D would have gotten you. Backup D doesn't really know or care that it happens to be based on B. It just knows that it can only give you those blocks that have LSN >= LSN_B. You can get those blocks from anywhere that you like. If D instead stored deltas between the blocks as they exist in backup B, then those deltas would have to be applied specifically to backup B, not some possibly-later version. I think the way to think about this problem, or at least the way I think about this problem, is that we need to decide whether want file-level incremental backup, block-level incremental backup, or byte-level incremental backup. pgbackrest implements file-level incremental backup: if the file has changed, copy the whole thing. That has an appealing simplicity but risks copying 1GB of data for a 1-byte change. What I'm proposing here is block-level incremental backup, which is more complicated and still risks copying 8kB of data for a 1-byte change. Using VCDIFF would, I think, give us byte-level incremental backup. That would probably do an excellent job of making incremental backups as small as they can possibly be, because we would not need to include in the backup image even a single byte of unmodified data. It also seems like it does some other compression tricks which could shrink incremental backups further. However, my intuition is that we won't gain enough in terms of backup size to make up for the downsides listed above. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company