Hi, On 12/17/2015 10:28 PM, Tomas Vondra wrote:
Hi,On 12/17/2015 07:20 PM, Robert Haas wrote: ...If this doesn't regress performance in the case where the number of buckets is estimated accurately to begin with, then I think this is a great idea. Can you supply some performance tests results for that case, and maybe some of the other cases also?I don't see how it could regress performance, and the benchmarks I've done confirm that. I'll do more thorough benchmarking and post the results here, but not now as this patch is in 2016-01 CF and I want to put all my time into reviewing patches from the open commitfest.
I've finally got to do more thorough benchmarks, and surprisingly there seems to be a minor regression. The scripts I've used are attached, along with results, but in short it joins two tables, with different combinations:
1M x 10M 10M x 100M 5M x 250Mwith different work_mem sizes (so that the smallest value uses a bunch of batches while the largest value uses a single batch).
The 1x10 and 10x100 are designed to fit into RAM on the machine (including the temporary files in batching mode), while 5x250 is designed to specifically force the temp files to disk (but it's on fast SSD array, so not a big deal).
Average timings for current master and the first two patches of the series (0001 postpones the build of buckets and 0002 always starts without batching) look like this (the timings are in milliseconds):
size work_mem master postpone no-batch ----------------------------------------------------- 1x10 8 5735 5760 6107 32 5939 5955 6124 128 4852 5038 5175 ----------------------------------------------------- 5x250 64 158512 158429 159008 256 144046 144584 144994 1024 111478 117529 116933 ----------------------------------------------------- 10x100 64 68389 68278 68530 256 66693 66415 66605 1024 48606 50654 50490 So compared to master, the differences look like this: size work_mem postpone no-batch ----------------------------------------- 1x10 8 0.44% 6.50% 32 0.28% 3.13% 128 3.83% 6.65% ----------------------------------------- 5x250 64 -0.05% 0.31% 256 0.37% 0.66% 1024 5.43% 4.89% ----------------------------------------- 10x100 64 -0.16% 0.21% 256 -0.42% -0.13% 1024 4.21% 3.88%So for the first patch (postponing building of buckets) with batching, the difference is rather negligible, possibly within noise (~0.5%). When the hash join runs with a single batch, the difference however gets much more significant 4-5%.
I'm not going to discuss the second patch for now, but naturally it has the same issue and apparently also some additional ones (at least on the 1x10 data set).
I have spent a fair amount of time profiling this, and I can't wrap my head around where the difference comes from, though. Essentially we don't do any additional stuff, we've just moved some of the stuff to a different place in the code.
It might change the memory access pattern a bit, but the original patterns are not exactly sequential or so, as we're moving random updates to array of pointers. Or rather moving them to the BuildBuckets() method, so if something consumes more time, it should be in this method, I believe.
So I've measured how much time the function takes for the 1x10 and 10x100 datasets, and it's ~25ms and ~290ms respectively. That's way less than the actual difference in query runtime, which is ~190ms and ~2050ms.
So even if we entirely skipped the bucket build, we would not recover the difference. Not sure what's going on here.
I've also done some profiling using perf, but I haven't really found anything suspicious there.
Any ideas what might be the cause of this? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
hash-scripts.tgz
Description: application/compressed-tar
hashes-delayed.ods
Description: application/vnd.oasis.opendocument.spreadsheet
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers