On Mon, Feb 11, 2019 at 12:54 PM Peter Geoghegan <p...@bowt.ie> wrote: > Notable improvements in v12:
I've been benchmarking v12, once again using a slightly modified BenchmarkSQL that doesn't do up-front CREATE INDEX builds [1], since the problems with index bloat don't take so long to manifest themselves when the indexes are inserted into incrementally from the very beginning. This benchmarking process took over 20 hours, with a database that started off at about 90GB (700 TPC-C/BenchmarkSQL warehouses were used). That easily exceeded available main memory on my test server, which was 32GB. This is a pretty I/O bound workload, and a fairly write-heavy one at that. I used a Samsung 970 PRO 512GB, NVMe PCIe M.2 2280 SSD for both pg_wal and the default and only tablespace. Importantly, I figured out that I should disable both hash joins and merge joins with BenchmarkSQL, in order to force all joins to be nested loop joins. Otherwise, the "stock level" transaction eventually starts to use a hash join, even though that's about 10x slower than a nestloop join (~4ms vs. ~40ms on this machine) -- the hash join produces a lot of noise without really testing anything. It usually takes a couple of hours before we start to get obviously-bad plans, but it also usually takes about that long until the patch series starts to noticeably overtake the master branch. I don't think that TPC-C will ever benefit from using a hash join or a merge join, since it's supposed to be a pure OLTP benchmark, and is a benchmark that MySQL is known to do at least respectably-well on. This is the first benchmark I've published that was considerably I/O bound. There are significant improvements in performance across the board, on every measure, though it takes several hours for that to really show. The benchmark was not rate-limited. 16 clients/"terminals" are used throughout. There were 5 runs for master and 5 for patch, interlaced, each lasting 2 hours. Initialization occurred once, so it's expected that both databases will gradually get larger across runs. Summary (appears in same order as the execution of each run) -- each run is 2 hours, so 20 hours total excluding initial load time (2 hours * 5 runs for master + 2 hours * 5 runs for patch): Run 1 -- master: Measured tpmTOTAL = 90063.79, Measured tpmC (NewOrders) = 39172.37 Run 1 -- patch: Measured tpmTOTAL = 90922.63, Measured tpmC (NewOrders) = 39530.2 Run 2 -- master: Measured tpmTOTAL = 77091.63, Measured tpmC (NewOrders) = 33530.66 Run 2 -- patch: Measured tpmTOTAL = 83905.48, Measured tpmC (NewOrders) = 36508.38 <-- 8.8% increase in tpmTOTAL/throughput Run 3 -- master: Measured tpmTOTAL = 71224.25, Measured tpmC (NewOrders) = 30949.24 Run 3 -- patch: Measured tpmTOTAL = 78268.29, Measured tpmC (NewOrders) = 34021.98 <-- 9.8% increase in tpmTOTAL/throughput Run 4 -- master: Measured tpmTOTAL = 71671.96, Measured tpmC (NewOrders) = 31163.29 Run 4 -- patch: Measured tpmTOTAL = 73097.42, Measured tpmC (NewOrders) = 31793.99 Run 5 -- master: Measured tpmTOTAL = 66503.38, Measured tpmC (NewOrders) = 28908.8 Run 5 -- patch: Measured tpmTOTAL = 71072.3, Measured tpmC (NewOrders) = 30885.56 <-- 6.9% increase in tpmTOTAL/throughput There were *also* significant reductions in transaction latency for the patch -- see the full html reports in the provided tar archive for full details (URL provided below). The html reports have nice SVG graphs, generated by BenchmarkSQL using R -- one for transaction throughput, and another for transaction latency. The overall picture is that the patched version starts out ahead, and has a much more gradual decline as the database becomes larger and more bloated. Note also that the statistics collector stats show a *big* reduction in blocks read into shared_buffers for the duration of these runs. For example, here is what pg_stat_database shows for run 3 (I reset the stats between runs): master: blks_read = 78,412,640, blks_hit = 4,022,619,556 patch: blks_read = 70,033,583, blks_hit = 4,505,308,517 <-- 10.7% reduction in blks_read/logical I/O This suggests an indirect benefit, likely related to how buffers are evicted in each case. pg_stat_bgwriter indicates that more buffers are written out during checkpoints, while fewer are written out by backends. I won't speculate further on what all of this means right now, though. You can find the raw details for blks_read for each and every run in the full tar archive. It is available for download from: https://drive.google.com/file/d/1kN4fDmh1a9jtOj8URPrnGYAmuMPmcZax/view?usp=sharing There are also dumps of the other pg_stat* views at the end of each run, logs for each run, etc. There's more information than anybody else is likely to find interesting. If anyone needs help in recreating this benchmark, then I'd be happy to assist in that. The is a shell script (zsh) included in the tar archive, although that will need to be changed a bit to point to the correct installations and so on. Independent validation of the performance of the patch series on this and other benchmarks is very welcome. [1] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- Peter Geoghegan