On 2/13/21 10:39 PM, Stephen Frost wrote:
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:
Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

     hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.

I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).

I agree that when a page is looked up, and found, in the buffer table
that the subsequent cacheing of the buffer id in the WAL records does a
good job of avoiding having to re-do that lookup.  However, that isn't
the case which was being discussed here or what Tomas's suggestion was
intended to address.

What I pointed out up-thread and what's being discussed here is what
happens when the WAL contains a few FPIs and a few regular WAL records
which are mixed up and not in ideal order.  When that happens, with this
patch, the FPIs will be ignored, the regular WAL records will reference
blocks which aren't found in shared buffers (yet) and then we'll both
issue pre-fetches for those and end up having spent effort doing a
buffer lookup that we'll later re-do.


The question is how common this pattern actually is - I don't know. As noted, the non-FPI would have to be fairly close to the FPI, i.e. within the wal_decode_buffer_size, to actually cause measurable harm.

To address the unnecessary syscalls we really just need to keep track of
any FPIs that we've seen between where the point where the prefetching
is happening and the point where the replay is being done- once replay
has replayed an FPI, our buffer lookup will succeed and we'll cache the
buffer that the FPI is at- in other words, only wal_decode_buffer_size
amount of WAL needs to be considered.


Yeah, that's essentially what I proposed.

We could further leverage this tracking of FPIs, to skip the prefetch
syscalls, by cacheing what later records address the blocks that have
FPIs earlier in the queue with the FPI record and then when replay hits
the FPI and loads it into shared_buffers, it could update the other WAL
records in the queue with the buffer id of the page, allowing us to very
likely avoid having to do another lookup later on.


This seems like an over-engineering, at least for v1.

I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.

This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints.  As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).


I'm OK to do some benchmarking, but it's not quite clear to me why does it matter if the checkpoints are smaller than shared buffers? IMO what matters is how "localized" the updates are, i.e. how likely it is to hit the same page repeatedly (in a short amount of time). Regular pgbench is not very suitable for that, but some non-uniform distribution should do the trick, I think.

Perhaps this already improves the common cases and is worth the extra
code on that basis, but I don't recall seeing much in the way of
benchmarking in this thread for that case- that is, where FPIs are
enabled and checkpoints are smaller than shared buffers.  Jakub's
testing was done with FPWs disabled and Tomas's testing used checkpoints
which were much larger than the size of shared buffers on the system
doing the replay.  While it's certainly good that this patch improves
those cases, we should also be looking out for the worst case and make
sure that the patch doesn't degrade performance in that case.


I'm with Andres on this. It's fine to leave some possible optimizations on the table for the future. And even if some workloads are affected negatively, it's still possible to disable the prefetching.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to