I and various colleagues of mine have from time to time encountered
systems that got a bit behind on WAL archiving, because the
archive_command started failing and nobody noticed right away.
Ideally, people should have monitoring for this and put it to rights
immediately, but some people don't. If those people happen to have a
relatively small pg_wal partition, they will likely become aware of
the issue when it fills up and takes down the server, but some users
provision disk space pretty generously and therefore nothing compels
them to notice the issue until they fill it up. In at least one case,
on a system that was actually generating a reasonable amount of WAL,
this took in excess of six months.

As you might imagine, pg_wal can get fairly large in such scenarios,
but the user is generally less concerned with solving that problem
than they are with getting the system back up. It is doubtless true
that the user would prefer to shrink the disk usage down to something
more reasonable over time, but on the facts as presented, it can't
really be an urgent issue for them. What they really need is just free
up a little disk space somehow or other and then get archiving running
fast enough to keep up with future WAL generation. Regrettably, the
archiver cannot do this, not even if you set archive_command =
/bin/true, because the archiver will barely ever actually run the
archive_command. Instead, it will spend virtually all of its time
calling readdir(), because for some reason it feels a need to make a
complete scan of the archive_status directory before archiving a WAL
file, and then it has to make another scan before archiving the next
one.

Someone - and it's probably for the best that the identity of that
person remains unknown to me - came up with a clever solution to this
problem, which is now used almost as a matter of routine whenever this
comes up. You just run pg_archivecleanup on your pg_wal directory, and
then remove all the corresponding .ready files and call it a day. I
haven't scrutinized the code for pg_archivecleanup, but evidently it
avoids needing O(n^2) time for this and therefore can clean up the
whole directory in something like the amount of time the archiver
would take to deal with a single file. While this seems to be quite an
effective procedure and I have not yet heard any user complaints, it
seems disturbingly error-prone, and honestly shouldn't ever be
necessary. The issue here is only that pgarch.c acts as though after
archiving 000000010000000000000001, 000000010000000000000002, and then
000000010000000000000003, we have no idea what file we might need to
archive next. Could it, perhaps, be 000000010000000000000004? Only a
full directory scan will tell us the answer!

I have two possible ideas for addressing this; perhaps other people
will have further suggestions. A relatively non-invasive fix would be
to teach pgarch.c how to increment a WAL file name. After archiving
segment N, check using stat() whether there's an .ready file for
segment N+1. If so, do that one next. If not, then fall back to
performing a full directory scan. As far as I can see, this is just
cheap insurance. If archiving is keeping up, the extra stat() won't
matter much. If it's not, this will save more system calls than it
costs. Since during normal operation it shouldn't really be possible
for files to show up in pg_wal out of order, I don't really see a
scenario where this changes the behavior, either. If there are gaps in
the sequence at startup time, this will cope with it exactly the same
as we do now, except with a better chance of finishing before I
retire.

However, that's still pretty wasteful. Every time we have to wait for
the next file to be ready for archiving, we'll basically fall back to
repeatedly scanning the whole directory, waiting for it to show up.
And I think that we can't get around that by just using stat() to look
for the appearance of the file we expect to see, because it's possible
that we might be doing all of this on a standby which then gets
promoted, or some upstream primary gets promoted, and WAL files start
appearing on a different timeline, making our prediction of what the
next filename will be incorrect. But perhaps we could work around this
by allowing pgarch.c to access shared memory, in which case it could
examine the current timeline whenever it wants, and probably also
whatever LSNs it needs to know what's safe to archive. If we did that,
could we just get rid of the .ready and .done files altogether? Are
they just a really expensive IPC mechanism to avoid a shared memory
connection, or is there some more fundamental reason why we need them?
And is there any good reason why the archiver shouldn't be connected
to shared memory? It is certainly nice to avoid having more processes
connected to shared memory than necessary, but the current scheme is
so inefficient that I think we end up worse off.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Reply via email to