On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan <p...@heroku.com> wrote: >> I'm not even sure this is necessary. The idea of missing out on >> producing a single sorted run sounds bad but in practice since we >> normally do the final merge on the fly there doesn't seem like there's >> really any difference between reading one tape or reading two or three >> tapes when outputing the final results. There will be the same amount >> of I/O happening and a 2-way or 3-way merge for most data types should >> be basically free. > > I basically agree with you, but it seems possible to fix the > regression (generally misguided though those regressed cases are). > It's probably easiest to just fix it.
Here is a benchmark on my laptop: $ pgbench -i -s 500 --unlogged This results in a ~1GB accounts PK: postgres=# \di+ pgbench_accounts_pkey List of relations ─[ RECORD 1 ]────────────────────── Schema │ public Name │ pgbench_accounts_pkey Type │ index Owner │ pg Table │ pgbench_accounts Size │ 1071 MB Description │ The query I'm testing is: "reindex index pgbench_accounts_pkey;" Now, with a maintenance_work_mem of 5MB, the most recent revision of my patch takes about 54.2 seconds to complete this, as compared to master's 44.4 seconds. So, clearly a noticeable regression there of just under 20%. I did not see a regression with a 5MB maintenance_work_mem when pgbench scale was 100, though. And, with the default maintenance_work_mem of 64MB, it's a totally different story -- my patch takes about 28.3 seconds, whereas master takes 48.5 seconds (i.e. longer than with 5MB). My patch needs a 56-way final merge with the 64MB maintenance_work_mem case, and 47 distinct merge steps, plus a final on-the-fly merge for the 5MB maintenance_work_mem case. So, a huge amount of merging, but RS still hardly pays for itself. With the regressed case for my patch, we finish sorting *runs* about 15 seconds in to a 54.2 second operation -- very early. So it isn't "quicksort vs replacement selection", so much as "polyphase merge vs replacement selection". There is a good reason to think that we can make progress on fixing that regression by doubling down on the general strategy of improving cache characteristics, and being cleverer about memory use during non-final merging, too. I looked at what it would take to make the heap a smaller part of memtuples, along the lines Robert and I talked about, and I think it's non-trivial because it needs to make the top of the heap something other than memtuples[0]. I'd need to change the heap code, which already has 3 reasons for existing (RS, merging, and top-N heap). I'll find it really hard to justify the effort, and especially the risk of adding bugs, for a benefit that there is *scant* evidence for. My guess is that the easiest, and most sensible way to fix the ~20% regression seen here is to introduce batch memory allocation to non-final merge steps, which is where most time was spent. (For simplicity, that currently only happens during the final merge phase, but I could revisit that -- seems not that hard). Now, I accept that the cost model has to go. So, what I think would be best is if we still added a GUC, like the replacement_sort_mem suggestion that Robert made. This would be a threshold for using what is currently called "quicksort with spillover". There'd be no cost model. Jeff Janes also suggested something like this. The only regression that I find concerning is the one reported by Jeff Janes [1]. That didn't even involve a correlation, though, so no reason to think that it would be at all helped by what Robert and I talked about. It seemed like the patch happened to have the effect of tickling a pre-existing problem with polyphase merge -- what Jeff called an "anti-sweetspot". Jeff had a plausible theory for why that is. So, what if we try to fix polyphase merge? That would be easier. We could look at the tape buffer size, and the number of tapes, as well as memory access patterns. We might even make more fundamental changes to polyphase merge, since we don't use the more advanced variant that I think correlation is a red herring. Knuth suggests that his algorithm 5.4.3, cascade merge, has more efficient distribution of runs. The bottom line is that there will always be some regression somewhere. I'm not sure what the guiding principle for when that becomes unacceptable is, but you did seem sympathetic to the idea that really low work_mem settings (e.g. 4MB) with really big inputs were not too concerning [2]. I'm emphasizing Jeff's case now because I, like you [2], am much more worried about maintenance_work_mem default cases with regressions than anything else, and that *was* such a case. Like Jeff Janes, I don't care about his other regression of about 5% [3], which involved a 4MB work_mem + 100 million tuples. The other case (the one I do care about) was 64MB + 400 million tuples, and was a much bigger regression, which is suggestive of the unpredictable nature of problems with polyphase merge scheduling that Jeff talked about. Maybe we just got unlucky there, but that risk should not blind us to the fact that overwhelmingly, replacement selection is the wrong thing. I'm sorry that I've reversed myself like this, Robert, but I'm just not seeing a benefit to what we talked about, but I do see a cost. [1] http://www.postgresql.org/message-id/CAMkU=1zkbozkx-nqe-kjffmynm2hmgyl9askdeuhhwxassj...@mail.gmail.com [2] http://www.postgresql.org/message-id/CA+TgmoZGFt6BAxW9fYOn82VAf1u=v0zzx3bxms79phjg_9n...@mail.gmail.com [3] http://www.postgresql.org/message-id/CAM3SWZTYneCG1oZiPwRU=j6ks+vprxt2da1zmmqfbrd5jas...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers