On Tue, Nov 27, 2018 at 5:43 AM Andrey Borodin <x4...@yandex-team.ru> wrote: > > 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.mu...@enterprisedb.com> > > написал(а): > > [1] https://arxiv.org/pdf/1509.05053.pdf > > I've implemented all of the strategies used in that paper. > On a B-tree page we have a line pointers ordered in key order and tuples > residing at the end of the page. Tuples at the end are mostly "appended", > i.e. they usually are accommodated at the end of the free space. > The ides of the attached patch is to try to order tuples in different > strategies. This ordering happens during VACUUM and can be easily broken with > single insert. > Strategies are switched by #define: > 1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy > has no idea behind and is expected to work similar to baseline (no changes to > code at all) > 2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose > lower branch - your search is just a sequential read of bytes. > 3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two > sub-mids) together. The key ideas is that you cache-miss when read mid, but > tuples of both next steps are already fetched by that cache miss. > 4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both > two possible next mids in a single cache prefetch. > > Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH. > > I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, > MacOS 10.14.1) > ./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres > No changes in configs. > > Here are results in tps: > without prefetch > baseline 1448.331199 > seq 1433.737204 > bt 1463.701527 > veb 1457.586089 > eyz 1483.654765 > > with prefetch > baseline 1486.585895 > seq 1471.757042 > bt 1480.169967 > veb 1464.834678 > eyz 1460.323392 > > This numbers are not statistically cleared, just one run was done for every > number, experiment needs to be tuned anyway. > From this I can conclude: > 1. It is hard to observe significant performance influence on pgbench. Maybe > I should use some more representative B-tree performance tests? > 2. Cache prefetches seem to increase performance without any layout > strategies. But the difference is hair-splitting 2.6% > 3. For implemented strategies my preliminary results corresponds to the > result in a paper: Eyzinger layout is the best. > 4. Among layout strategies with prefetch, bt-order strategy seems to be > winner. > > How can I improve experimental evaluation of these layout strategies? > Other thoughts? I'd be happy to hear any comment on this.
What about read-only tests with prewarmed indexes? Those numbers look IO bound. This is focusing on prefetch within btree code, but I wonder if there is some lower hanging fruit that doesn't require any layout changes: heap prefetch during index scans. By analogy to figure 8a "software-pipelined prefetching" of [1], I wonder if you could build a pipeline using dirty (unlocked) memory reads, like this: 1. Prefetch buffer mapping hash bucket for heap_tids[n + 3] 2. Prefetch page header for heap_tids[n + 2] using a dirty read of the first value in the buffer mapping hash bucket 3. Prefetch heap tuple for tids[n + 1] using a dirty read of the heap page 4. Return heap tid for tids[0] to caller If you're lucky, by the time the caller uses the tid to the find the buffer, read the line pointer and access the tuple, all three things will be in cache and hopefully won't have changed since the 'dirty' reads. Prefetching the wrong cachelines would be possible obviously, if there is a buffer mapping hash table collision and the one you wanted isn't first, or stuff just changed. Maybe there aren't enough cachelines for this 3-level pipeline to work, considering that in between random other executor nodes could run, so perhaps just a 2-level or 1-level pipeline would pay off. As I showed in my toy hash join prefetch experiment[2], I could get 20-30% speedup from prefetching just hash buckets (equivalent to prefetching buffer mapping hash buckets), and then a further 5-10% speedup from prefetching the first item in the hash bucket, but I expect nested hash joins to interfere with that as they compete for cache lines. I suppose I could try to do three levels and fetch the next tuple in the chain, but that seems unlikely to help because we're getting into lower probabilities that I'd even even want that data; in contrast, the index scan -> heap scan case *definitely* needs to go through a line pointer to a tuple somewhere on the page so 3-level might be worth experimenting with. Just a though, I have not tried any of this. Maybe there are really 4 levels, if you count the buffer descriptor/lock. [1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf [2] https://www.postgresql.org/message-id/CA%2BhUKGKB%3DEWq%2Brj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ%40mail.gmail.com -- Thomas Munro https://enterprisedb.com