On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tan...@gmail.com> wrote: > On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.ja...@gmail.com> wrote: >> >> The attached patch allows it to reuse that memory. On my meager >> system it reduced the building of an index on an integer column in a >> skinny 200 million row totally randomly ordered table by about 3% from >> a baseline of 25 minutes. >> > > Just to give a standard review, this patch is one line change and > applies cleanly, builds ok. > > I'm not pretty sure what exactly you're trying to accomplish, but it > seems to me that it's avoiding the first dumptuples cycle by forcing > availMem = 0 even if it's negative.
Yes. Currently when it switches to the TSS_BUILDRUNS part of a tape-sort, it starts by calling WRITETUP a large number of time consecutively, to work off the memory deficit incurred by the 3 blocks per tape of tape overhead, and then after that calls WRITETUP about once per puttuple.. Under my patch, it would only call WRITETUP about once per puttuple, right from the beginning. > I read your comments as it'd be > avoiding to alternate reading/writing back and force with scattered > memory failing memory cache much during merge phase, but actually it > doesn't affect merge phase but only init-dump phase, does it? It effects the building of the runs. But this building of the runs is not a simple dump, it is itself a mini merge phase, in which it merges the existing in-memory priority queue against the still-incoming tuples from the node which invoked the sort. By using less memory than it could, this means that the resulting runs are smaller than they could be, and so will sometimes necessitate an additional layer of merging later on. (This effect is particularly large for the very first run being built. Generally by merging incoming tuples into the memory-tuples, you can create runs that are 1.7 times the size of fits in memory. By wasting some memory, we are getting 1.7 the size of a smaller starting point. But for the first run, it is worse than that. Most of the benefit that leads to that 1.7 multiplier comes at the very early stage of each run-build. But by initially using the full memory, then writing out a bunch of tuples without doing any merge of the incoming, we have truncated the part that gives the most benefit.) My analysis that the "freed" memory is never reused (because we refuse to reuse it ourselves and it is too fragmented to be reused by anyone else, like the palloc or VM system) only applies to the run-building phase. So never was a bit of an overstatement. By the time the last initial run is completely written out to tape, the heap used for the priority queue should be totally empty. So at this point the allocator would have the chance to congeal all of the fragmented memory back into larger chunks, or maybe it parcels the allocations back out again in an order so that the unused space is contiguous and could be meaningfully paged out. But, it is it worth worrying about how much we fragment memory and if we overshoot our promises by 10 or 20%? > If so, > I'm not so convinced your benchmark gave 3 % gain by this change. > Correct me as I'm probably wrong. I've now done more complete testing. Building an index on an 200,000,000 row table with an integer column populated in random order with integers from 1..500,000,000, non-unique, on a machine with 2GB of RAM and 600MB of shared_buffers. It improves things by 6-7 percent at the end of working mem size, the rest are in the noise except at 77936 KB, where it reproducibly makes things 4% worse, for reasons I haven't figured out. So maybe the best thing to do is, rather than micromanaging memory usage, simply don't set maintenance_work_mem way to low. (But, it is the default). maintenance_work_mem %Change with patch 16384 6.2 19484 7.8 23170 -0.1 27554 0.1 32768 1.9 38968 -1.9 46341 0.4 55109 0.4 65536 0.6 77936 -4.3 92682 -0.3 110218 0.1 131072 1.7 > Anyway, it's nice to modify the comment just above the change, since > it's now true with it. Much of that comment is referring to other things (some of which I don't understand). How about an additional comment just before my code to the gist of: /* If we have already used, and thus fragmented, the memory then we * might as well continue to make use of it as no one else can. */ (That might not actually be true, if the tuples we are sorting are very large, or perhaps if they arrive in already reverse sorted order) Thanks, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers