Hi, Apologies if this has already been discussed someplace, but I couldn't find a previous discussion. It seems to me that base backups are broken in the face of a concurrent truncation that reduces the number of segments in a relation.
Suppose we have a relation that is 1.5GB in size, so that we have two files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back those files up in whichever order the directory scan finds them. Suppose we back up 23456.1 first. Then the relation is truncated to 0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we back up the file 23456. Now our backup contains files 23456 and 23456.1, each 0.5GB. But this breaks the invariant in md.c: * On disk, a relation must consist of consecutively numbered segment * files in the pattern * -- Zero or more full segments of exactly RELSEG_SIZE blocks each * -- Exactly one partial segment of size 0 <= size < RELSEG_SIZE blocks * -- Optionally, any number of inactive segments of size 0 blocks. basebackup.c's theory about relation truncation is that it doesn't really matter because WAL replay will fix things up. But in this case, I don't think it will, because WAL replay relies on the above invariant holding. As mdnblocks says: /* * If segment is exactly RELSEG_SIZE, advance to next one. */ segno++; So I think what's going to happen is we're not going to notice 23456.1 when we recover the backup. It will just sit there as an orphaned file forever, unless we extend 23456 back to a full 1GB, at which point we might abruptly start considering that file part of the relation again. Assuming I'm not wrong about all of this, the question arises: whose fault is this, and what to do about it? It seems to me that it's a bit hard to blame basebackup.c, because if you used pg_backup_start() and pg_backup_stop() and copied the directory yourself, you'd have exactly the same situation, and while we could (and perhaps should) teach basebackup.c to do something smarter, it doesn't seem realistic to impose complex constraints on the user's choice of file copy tool. Furthermore, I think that the problem could arise without performing a backup at all: say that the server crashes on the OS level in mid-truncation, and the truncation of segment 0 reaches disk but the removal of segment 1 does not. So I think the problem is with md.c assuming that its invariant must hold on a cluster that's not guaranteed to be in a consistent state. But mdnblocks() clearly can't try to open every segment up to whatever the maximum theoretical possible segment number is every time it's invoked, because that would be wicked expensive. An idea that occurs to me is to remove all segment files following the first partial segment during startup, before we begin WAL replay. If that state occurs at startup, then either we have a scenario involving truncation, like those above, or a scenario involving relation extension, where we added a new segment and that made it to disk but the prior extension of the previous last segment file to maximum length did not. But in that case, WAL replay should, I think, fix things up. However, I'm not completely sure that there isn't some hole in this theory, and this way forward also doesn't sound particularly cheap. Nonetheless I don't have another idea right now. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com