Re: finding changed blocks using WAL scanning

Robert Haas Sat, 20 Apr 2019 13:22:37 -0700

On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost <sfr...@snowman.net> wrote:
> > Oh.  Well, I already explained my algorithm for doing that upthread,
> > which I believe would be quite cheap.
> >
> > 1. When you generate the .modblock files, stick all the block
> > references into a buffer.  qsort().  Dedup.  Write out in sorted
> > order.
>
> Having all of the block references in a sorted order does seem like it
> would help, but would also make those potentially quite a bit larger
> than necessary (I had some thoughts about making them smaller elsewhere
> in this discussion).  That might be worth it though.  I suppose it might
> also be possible to line up the bitmaps suggested elsewhere to do
> essentially a BitmapOr of them to identify the blocks changed (while
> effectively de-duping at the same time).


I don't see why this would make them bigger than necessary.  If you
sort by relfilenode/fork/blocknumber and dedup, then references to
nearby blocks will be adjacent in the file.  You can then decide what
format will represent that most efficiently on output.  Whether or not
a bitmap is better idea than a list of block numbers or something else
depends on what percentage of blocks are modified and how clustered
they are.

> > 2. When you want to use a bunch of .modblock files, do the same thing
> > MergeAppend does, or what merge-sort does when it does a merge pass.
> > Read the first 1MB of each file (or whatever amount).  Repeatedly pull
> > an item from whichever file has the lowest remaining value, using a
> > binary heap.  When no buffered data remains for a particular file,
> > read another chunk from that file.
>
> Sure, this is essentially a MergeAppend/MergeSort+GroupAgg on top to get
> the distinct set, if I'm following what you're suggesting here.

Yeah, something like that.

> > If each .modblock file covers 1GB of WAL, you could the data from
> > across 1TB of WAL using only 1GB of memory, and that's assuming you
> > have a 1MB buffer for each .modblock file.  You probably don't need
> > such a large buffer.  If you use, say, a 128kB buffer, you could merge
> > the data from across 8TB of WAL using 1GB of memory.  And if you have
> > 8TB of WAL and you can't spare 1GB for the task of computing which
> > blocks need to be included in your incremental backup, it's time for a
> > hardware upgrade.
>
> How much additional work is it going to be to sort/dedup for each 1GB of
> WAL though, along with the resulting size?

Well all you have to do is quicksort an array with a million or so
elements.  I don't know off-hand how many CPU cycles that takes but I
doubt it's a whole lot.  And for the amount of CPU time and memory
that it saves you when you actually go to use the files, I think it's
got to be worth it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: finding changed blocks using WAL scanning

Reply via email to