On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmh...@gmail.com> wrote: > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. > Now, maybe in the future we'll want to work on other techniques for > reducing contention, but I don't think we should make that the problem > of this patch, especially because the regressions are small and go > away after a few hours of heavy use. We should optimize for the case > where the user intends to keep the database around for years, not > hours.
I came back to the question of contention recently. I don't think it's okay to make contention worse in workloads where indexes are mostly the same size as before, such as almost any workload that pgbench can simulate. I have made a lot of the fact that the TPC-C indexes are ~40% smaller, in part because lots of people outside the community find TPC-C interesting, and in part because this patch series is focused on cases where we currently do unusually badly (cases where good intuitions about how B-Trees are supposed to perform break down [1]). These pinpointable problems must affect a lot of users some of the time, but certainly not all users all of the time. The patch series is actually supposed to *improve* the situation with index buffer lock contention in general, and it looks like it manages to do that with pgbench, which doesn't do inserts into indexes, except for those required for non-HOT updates. pgbench requires relatively few page splits, but is in every other sense a high contention workload. With pgbench scale factor 20, here are results for patch and master with a Gaussian distribution on my 8 thread/4 core home server, with each run reported lasting 10 minutes, repeating twice for client counts 1, 2, 8, 16, and 64, patch and master branch: \set aid random_gaussian(1, 100000 * :scale, 20) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; 1st pass ======== (init pgbench from scratch for each database, scale 20) 1 client master: tps = 7203.983289 (including connections establishing) tps = 7204.020457 (excluding connections establishing) latency average = 0.139 ms latency stddev = 0.026 ms 1 client patch: tps = 7012.575167 (including connections establishing) tps = 7012.590007 (excluding connections establishing) latency average = 0.143 ms latency stddev = 0.020 ms 2 clients master: tps = 13434.043832 (including connections establishing) tps = 13434.076194 (excluding connections establishing) latency average = 0.149 ms latency stddev = 0.032 ms 2 clients patch: tps = 13105.620223 (including connections establishing) tps = 13105.654109 (excluding connections establishing) latency average = 0.153 ms latency stddev = 0.033 ms 8 clients master: tps = 27126.852038 (including connections establishing) tps = 27126.986978 (excluding connections establishing) latency average = 0.295 ms latency stddev = 0.095 ms 8 clients patch: tps = 27945.457965 (including connections establishing) tps = 27945.565242 (excluding connections establishing) latency average = 0.286 ms latency stddev = 0.089 ms 16 clients master: tps = 32297.612323 (including connections establishing) tps = 32297.743929 (excluding connections establishing) latency average = 0.495 ms latency stddev = 0.185 ms 16 clients patch: tps = 33434.889405 (including connections establishing) tps = 33435.021738 (excluding connections establishing) latency average = 0.478 ms latency stddev = 0.167 ms 64 clients master: tps = 25699.029787 (including connections establishing) tps = 25699.217022 (excluding connections establishing) latency average = 2.482 ms latency stddev = 1.715 ms 64 clients patch: tps = 26513.816673 (including connections establishing) tps = 26514.013638 (excluding connections establishing) latency average = 2.405 ms latency stddev = 1.690 ms 2nd pass ======== (init pgbench from scratch for each database, scale 20) 1 client master: tps = 7172.995796 (including connections establishing) tps = 7173.013472 (excluding connections establishing) latency average = 0.139 ms latency stddev = 0.022 ms 1 client patch: tps = 7024.724365 (including connections establishing) tps = 7024.739237 (excluding connections establishing) latency average = 0.142 ms latency stddev = 0.021 ms 2 clients master: tps = 13489.016303 (including connections establishing) tps = 13489.047968 (excluding connections establishing) latency average = 0.148 ms latency stddev = 0.032 ms 2 clients patch: tps = 13210.292833 (including connections establishing) tps = 13210.321528 (excluding connections establishing) latency average = 0.151 ms latency stddev = 0.029 ms 8 clients master: tps = 27470.112858 (including connections establishing) tps = 27470.229891 (excluding connections establishing) latency average = 0.291 ms latency stddev = 0.093 ms 8 clients patch: tps = 28132.981815 (including connections establishing) tps = 28133.096414 (excluding connections establishing) latency average = 0.284 ms latency stddev = 0.081 ms 16 clients master: tps = 32409.399669 (including connections establishing) tps = 32409.533400 (excluding connections establishing) latency average = 0.493 ms latency stddev = 0.182 ms 16 clients patch: tps = 33678.304986 (including connections establishing) tps = 33678.427420 (excluding connections establishing) latency average = 0.475 ms latency stddev = 0.168 ms 64 clients master: tps = 25864.453485 (including connections establishing) tps = 25864.639098 (excluding connections establishing) latency average = 2.466 ms latency stddev = 1.698 ms 64 clients patch: tps = 26382.926218 (including connections establishing) tps = 26383.166692 (excluding connections establishing) latency average = 2.417 ms latency stddev = 1.678 ms There was a third run which has been omitted, because it's practically the same as the first two. The order that results appear in is the order things actually ran in (I like to interlace master and patch runs closely). Analysis ======== There seems to be a ~2% regression with one or two clients, but we more than make up for that as the client count goes up -- the 8 and 64 client cases improve throughput by ~2.5%, and the 16 client case improves throughput by ~4%. This seems like a totally reasonable trade-off to me. As I said already, the patch isn't really about workloads that we already do acceptably well on, such as this one, so you're not expected to be impressed with these numbers. My goal is to show that boring workloads that fit everything in shared_buffers appear to be fine. I think that that's a reasonable conclusion, based on these numbers. Lower client count cases are generally considered less interesting, and also lose less in throughput than we go on to gain later. as more clients are added. I'd be surprised if anybody complained. I think that the explanation for the regression with one or two clients boils down to this: We're making better decisions about where to split pages, and even about how pages are accessed by index scans (more on that in the next paragraph). However, this isn't completely free (particularly the page split stuff), and it doesn't pay for itself until the number of clients ramps up. However, not being more careful about that stuff is penny wise, pound foolish. I even suspect that there are priority inversion issues when there is high contention during unique index enforcement, which might be a big problem on multi-socket machines with hundreds of clients. I am not in a position to confirm that right now, but we have heard reports that are consistent with this explanation at least once before now [2]. Zipfian was also somewhat better when I last measured it, using the same fairly modest machine -- I didn't repeat that here because I wanted something simple and widely studied. The patch establishes the principle that there is only one good reason to visit more than one leaf page within index scans like those used by pgbench: a concurrent page split, where the scan simply must go right to find matches that were just missed in the first leaf page. That should be very rare. We should never visit two leaf pages because we're confused about where there might be matches. There is simply no good reason for there to be any ambiguity or confusion. The patch could still make index scans like these visit more than a single leaf page for a bad reason, at least in theory: when there is at least ~400 duplicates in a unique index, and we therefore can't possibly store them all on one leaf page, index scans will of course have to visit more than one leaf page. Again, that should be very rare. All index scans can now check the high key on the leaf level, and avoid going right when they happen to be very close to the right edge of the leaf page's key space. And, we never have to take the scenic route when descending the tree on an equal internal page key, since that condition has practically been eliminated by suffix truncation. No new tuple can be equal to negative infinity, and negative infinity appears in every pivot tuple. There is a place for everything, and everything is in its place. [1] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/ [2] https://postgr.es/m/bf3b6f54-68c3-417a-bfab-fb4d66f2b...@postgrespro.ru -- Peter Geoghegan