finding changed blocks using WAL scanning

Robert Haas Wed, 10 Apr 2019 14:50:23 -0700

Over at 
https://www.postgresql.org/message-id/CA%2BTgmobFVe4J4AA7z9OMUzKnm09Tt%2BsybhxeL_Ddst3q3wqpzQ%40mail.gmail.com
I mentioned parsing the WAL to extract block references so that
incremental backup could efficiently determine which blocks needed to
be copied.  Ashwin replied in
http://postgr.es/m/CALfoeitO-vkfjubMFQRmgyXghL-uUnZLNxbr=obrqqsm8kf...@mail.gmail.com
to mention that the same approach could be useful for pg_upgrade:


# Currently, pg_rewind requires
# all the WAL logs to be present on source side from point of divergence to
# rewind. Instead just parse the wal and keep the changed blocks around on
# sourece. Then don't need to retain the WAL but can still rewind using the
# changed block map.

Since there are at least two possible use case for an efficient way to
learn when blocks last changed, and in each case the performance
benefits of being able to learn that are potentially quite large, I'm
starting this thread to brainstorm how such a system might work.

It seems to me that there are basically two ways of storing this kind
of information, plus a bunch of variants.  One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs.  You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL.  In
this type of scheme, the storage required is roughly proportional to
the volume of WAL for which you wish to retain data.  Pruning old data
is easy; just remove the files that provide information about LSNs
that you don't care about any more.  The other way is to store data
about each block, or each range of blocks, or all the blocks that hash
onto a certain slot; for each, store the newest LSN that has modified
that block, or a block in that range, or a block that hashes onto that
that slot.  In this system, storage is roughly proportional to the
size of the database cluster, except maybe in the hashing case, but I
*think* that case will degrade unless you basically expand the map to
be roughly proportional to the size of the cluster anyway.  I may be
wrong.

Of these two variants, I am inclined to prefer the version where each
file is a summary of the block references within some range of LSNs.
It seems simpler to implement to me.  You just read a bunch of WAL
files and then when you get tired you stop and emit an output file.
You need to protect yourself against untimely crashes.  One way is to
stick a checksum into the output file.  After you finish writing it,
fsync() it before you start writing the next one.  After a restart,
read the latest such file and see if the checksum is OK.  If not,
regenerate it; if not, assume it's good and move on.  Files other than
the last one can be assumed good.  Another way is to create the file
with a temporary name, fsync() it, and then rename it into place and
fsync() again.  The background worker that generates the files can
have a GUC to remove them when they are older than some threshold
amount of time, or you can keep them forever and let the user manually
remove stuff they no longer want based on LSN.  That's pretty much it.

The version where you keep an LSN per block/range/hash bucket seems
more complex in terms of durability.  Now you have to worry not only
about creating new files, but about modifying them.  That seems to add
some complexity.  I think it may be possible to make it work without
doing write-ahead logging for every change, but it certainly needs
careful thought to avoid torn page problems and checkpoint
synchronization issues.  Moreover, it potentially uses lots and lots
of inodes if there are many relations in the cluster.  You can avoid
that by not creating maps for small files, if you like, or by
switching to the hash bucket approach.  But either of those approaches
is lossy.  Omitting the maps for small files means you always have to
assume everything in those files changed. The hash bucket approach is
vulnerable to hash collisions; you have to assume that all blocks that
hash to a given bucket have changed.  Those are probably manageable
disadvantages, but I think they are real ones.

There is one thing that does worry me about the file-per-LSN-range
approach, and that is memory consumption when trying to consume the
information.  Suppose you have a really high velocity system.  I don't
know exactly what the busiest systems around are doing in terms of
data churn these days, but let's say just for kicks that we are
dirtying 100GB/hour.  That means, roughly 12.5 million block
references per hour.  If each block reference takes 12 bytes, that's
maybe 150MB/hour in block reference files.  If you run a daily
incremental backup, you've got to load all the block references for
the last 24 hours and deduplicate them, which means you're going to
need about 3.6GB of memory.  If you run a weekly incremental backup,
you're going to need about 25GB of memory.  That is not ideal.  One
can keep the memory consumption to a more reasonable level by using
temporary files.  For instance, say you realize you're going to need
25GB of memory to store all the block references you have, but you
only have 1GB of memory that you're allowed to use.  Well, just
hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
writing each batch to a separate temporary file, and then process each
of those 32 files separately.  That does add some additional I/O, but
it's not crazily complicated and doesn't seem too terrible, at least
to me.  Still, it's something not to like.

Another problem to think about is whether the most recent data is
going to be available when you need it.  This concern applies to
either approach.  In the case of incremental backup, the server is
going to be up and running, so if the WAL-scanner gets behind, you can
just wait for it to catch up.  In the case of pg_rewind, the server is
going to be down, so that doesn't work.  You probably need to figure
out how new the data you have is, and then scan all the newer WAL to
pick up any additional block references.  That's a bit of a pain, but
I don't see any real alternative.  In the case of the
file-per-LSN-range approach, it's easy to see what LSNs are covered by
the files.  In the case of the LSN-per-block/range/hash bucket
approach, you can presumably rely on the last checkpoint having
flushed all the pending changes to disk, but the WAL scanner could've
been arbitrarily far behind at that point, so you'd have to store a
piece of metadata someplace that tells you exactly how far the WAL
scanner had progressed and then have pg_rewind fish it out.

Yet another question is how to make sure WAL doesn't get removed
before we finish scanning it.  Peter mentioned on the other thread
that we could use a variant replication slot, which immediately made
me wonder why we'd need a variant.  Actually, the biggest problem I
see here is that if we use a replication slot, somebody might try to
drop it or use it for some other purpose, and that would mess things
up.  I guess we could pull the usual trick of reserving names that
start with 'pg_' for internal purposes.  Or we could just hard-code
the LSN that was last scanned for this purpose as a bespoke constraint
on WAL discard.  Not sure what is best.

I think all of this should be optional functionality.  It's going to
be really useful for people with large databases, I think, but people
with small databases may not care, and it won't be entirely free.  If
it's not enabled, then the functionality that would otherwise exploit
it can fall back to doing things in a less efficient way; nothing
needs to break hard.

Opinions?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

finding changed blocks using WAL scanning

Reply via email to