On Tue, Nov 7, 2023 at 5:53 PM Peter Geoghegan <p...@bowt.ie> wrote: > If you end up finding a bug in this v6, it'll most likely be a case > where nbtree fails to live up to that. This project is as much about > robust/predictable performance as anything else -- nbtree needs to be > able to cope with practically anything. I suggest that your review > start by trying to break the patch along these lines.
I spent some time on this myself today (which I'd already planned on). Attached is an adversarial stress-test, which shows something that must be approaching the worst case for the patch in terms of time spent with a buffer lock held, due to spending so much time evaluating unusually expensive SAOP index quals. The array binary searches that take place with a buffer lock held aren't quite like anything else that nbtree can do right now, so it's worthy of special attention. I thought of several factors that maximize both the number of binary searches within any given _bt_readpage, as well as the cost of each binary search -- the SQL file has full details. My test query is *extremely* unrealistic, since it combines multiple independent unrealistic factors, all of which aim to make life hard for the implementation in one way or another. I hesitate to say that it couldn't be much worse (I only spent a few hours on this), but I'm prepared to say that it seems very unlikely that any real world query could make the patch spend as many cycles in _bt_readpage/_bt_checkkeys as this one does. Perhaps you can think of some other factor that would make this test case even less sympathetic towards the patch, Matthias? The only thing I thought of that I've left out was the use of a custom btree opclass, "unrealistically_slow_ops". Something that calls pg_usleep in its order proc. (I left it out because it wouldn't prove anything.) On my machine, custom instrumentation shows that each call to _bt_readpage made while this query executes (on a patched server) takes just under 1.4 milliseconds. While that is far longer than it usually takes, it's basically acceptable IMV. It's not significantly longer than I'd expect heap_index_delete_tuples() to take on an average day with EBS (or other network-attached storage). But that's a process that happens all the time, with an exclusive buffer lock held on the leaf page throughout -- whereas this is only a shared buffer lock, and involves a query that's just absurd . Another factor that makes this seem acceptable is just how sensitive the test case is to everything going exactly and perfectly wrong, all at the same time, again and again. The test case uses a 32 column index (the INDEX_MAX_KEYS maximum), with a query that has 32 SAOP clauses (one per index column). If I reduce the number of SAOP clauses in the query to (say) 8, I still have a test case that's almost as silly as my original -- but now we only spend ~225 microseconds in each _bt_readpage call (i.e. we spend over 6x less time in each _bt_readpage call). (Admittedly if I also make the CREATE INDEX use only 8 columns, we can fit more index tuples on one page, leaving us at ~800 microseconds). I'm a little surprised that it isn't a lot worse than this, given how far I went. I was a little concerned that it would prove necessary to lock this kind of thing down at some higher level (e.g., in the planner), but that now looks unnecessary. There are much better ways to DOS the server than this. For example, you could run this same query while forcing a sequential scan! That appears to be quite a lot less responsive to interrupts (in addition to being hopelessly slow), probably because it uses parallel workers, each of which will use wildly expensive filter quals that just do a linear scan of the SAOP. -- Peter Geoghegan
nemesis.sql
Description: Binary data