Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Peter Geoghegan Thu, 09 Nov 2023 15:58:38 -0800

On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <[email protected]> wrote:
> If you end up finding a bug in this v6, it'll most likely be a case
> where nbtree fails to live up to that. This project is as much about
> robust/predictable performance as anything else -- nbtree needs to be
> able to cope with practically anything. I suggest that your review
> start by trying to break the patch along these lines.


I spent some time on this myself today (which I'd already planned on).

Attached is an adversarial stress-test, which shows something that
must be approaching the worst case for the patch in terms of time
spent with a buffer lock held, due to spending so much time evaluating
unusually expensive SAOP index quals. The array binary searches that
take place with a buffer lock held aren't quite like anything else
that nbtree can do right now, so it's worthy of special attention.

I thought of several factors that maximize both the number of binary
searches within any given _bt_readpage, as well as the cost of each
binary search -- the SQL file has full details. My test query is
*extremely* unrealistic, since it combines multiple independent
unrealistic factors, all of which aim to make life hard for the
implementation in one way or another. I hesitate to say that it
couldn't be much worse (I only spent a few hours on this), but I'm
prepared to say that it seems very unlikely that any real world query
could make the patch spend as many cycles in
_bt_readpage/_bt_checkkeys as this one does.

Perhaps you can think of some other factor that would make this test
case even less sympathetic towards the patch, Matthias? The only thing
I thought of that I've left out was the use of a custom btree opclass,
"unrealistically_slow_ops". Something that calls pg_usleep in its
order proc. (I left it out because it wouldn't prove anything.)

On my machine, custom instrumentation shows that each call to
_bt_readpage made while this query executes (on a patched server)
takes just under 1.4 milliseconds. While that is far longer than it
usually takes, it's basically acceptable IMV. It's not significantly
longer than I'd expect heap_index_delete_tuples() to take on an
average day with EBS (or other network-attached storage). But that's a
process that happens all the time, with an exclusive buffer lock held
on the leaf page throughout -- whereas this is only a shared buffer
lock, and involves a query that's just absurd .

Another factor that makes this seem acceptable is just how sensitive
the test case is to everything going exactly and perfectly wrong, all
at the same time, again and again. The test case uses a 32 column
index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP
clauses (one per index column). If I reduce the number of SAOP clauses
in the query to (say) 8, I still have a test case that's almost as
silly as my original -- but now we only spend ~225 microseconds in
each _bt_readpage call (i.e. we spend over 6x less time in each
_bt_readpage call). (Admittedly if I also make the CREATE INDEX use
only 8 columns, we can fit more index tuples on one page, leaving us
at ~800 microseconds).

I'm a little surprised that it isn't a lot worse than this, given how
far I went. I was a little concerned that it would prove necessary to
lock this kind of thing down at some higher level (e.g., in the
planner), but that now looks unnecessary. There are much better ways
to DOS the server than this. For example, you could run this same
query while forcing a sequential scan! That appears to be quite a lot
less responsive to interrupts (in addition to being hopelessly slow),
probably because it uses parallel workers, each of which will use
wildly expensive filter quals that just do a linear scan of the SAOP.

-- 
Peter Geoghegan

nemesis.sql
Description: Binary data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

Reply via email to