I and various colleagues of mine have from time to time encountered systems that got a bit behind on WAL archiving, because the archive_command started failing and nobody noticed right away. Ideally, people should have monitoring for this and put it to rights immediately, but some people don't. If those people happen to have a relatively small pg_wal partition, they will likely become aware of the issue when it fills up and takes down the server, but some users provision disk space pretty generously and therefore nothing compels them to notice the issue until they fill it up. In at least one case, on a system that was actually generating a reasonable amount of WAL, this took in excess of six months.
As you might imagine, pg_wal can get fairly large in such scenarios, but the user is generally less concerned with solving that problem than they are with getting the system back up. It is doubtless true that the user would prefer to shrink the disk usage down to something more reasonable over time, but on the facts as presented, it can't really be an urgent issue for them. What they really need is just free up a little disk space somehow or other and then get archiving running fast enough to keep up with future WAL generation. Regrettably, the archiver cannot do this, not even if you set archive_command = /bin/true, because the archiver will barely ever actually run the archive_command. Instead, it will spend virtually all of its time calling readdir(), because for some reason it feels a need to make a complete scan of the archive_status directory before archiving a WAL file, and then it has to make another scan before archiving the next one. Someone - and it's probably for the best that the identity of that person remains unknown to me - came up with a clever solution to this problem, which is now used almost as a matter of routine whenever this comes up. You just run pg_archivecleanup on your pg_wal directory, and then remove all the corresponding .ready files and call it a day. I haven't scrutinized the code for pg_archivecleanup, but evidently it avoids needing O(n^2) time for this and therefore can clean up the whole directory in something like the amount of time the archiver would take to deal with a single file. While this seems to be quite an effective procedure and I have not yet heard any user complaints, it seems disturbingly error-prone, and honestly shouldn't ever be necessary. The issue here is only that pgarch.c acts as though after archiving 000000010000000000000001, 000000010000000000000002, and then 000000010000000000000003, we have no idea what file we might need to archive next. Could it, perhaps, be 000000010000000000000004? Only a full directory scan will tell us the answer! I have two possible ideas for addressing this; perhaps other people will have further suggestions. A relatively non-invasive fix would be to teach pgarch.c how to increment a WAL file name. After archiving segment N, check using stat() whether there's an .ready file for segment N+1. If so, do that one next. If not, then fall back to performing a full directory scan. As far as I can see, this is just cheap insurance. If archiving is keeping up, the extra stat() won't matter much. If it's not, this will save more system calls than it costs. Since during normal operation it shouldn't really be possible for files to show up in pg_wal out of order, I don't really see a scenario where this changes the behavior, either. If there are gaps in the sequence at startup time, this will cope with it exactly the same as we do now, except with a better chance of finishing before I retire. However, that's still pretty wasteful. Every time we have to wait for the next file to be ready for archiving, we'll basically fall back to repeatedly scanning the whole directory, waiting for it to show up. And I think that we can't get around that by just using stat() to look for the appearance of the file we expect to see, because it's possible that we might be doing all of this on a standby which then gets promoted, or some upstream primary gets promoted, and WAL files start appearing on a different timeline, making our prediction of what the next filename will be incorrect. But perhaps we could work around this by allowing pgarch.c to access shared memory, in which case it could examine the current timeline whenever it wants, and probably also whatever LSNs it needs to know what's safe to archive. If we did that, could we just get rid of the .ready and .done files altogether? Are they just a really expensive IPC mechanism to avoid a shared memory connection, or is there some more fundamental reason why we need them? And is there any good reason why the archiver shouldn't be connected to shared memory? It is certainly nice to avoid having more processes connected to shared memory than necessary, but the current scheme is so inefficient that I think we end up worse off. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com