On Thu, Sep 15, 2016 at 9:51 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote: >> I still don't get why you're doing all of this within mergeruns() (the >> beginning of when we start merging -- we merge all quicksorted runs), >> rather than within beginmerge() (the beginning of one particular merge >> pass, of which there are potentially more than one). As runs are >> merged in a non-final merge pass, fewer tapes will remain active for >> the next merge pass. It doesn't do to do all that up-front when we >> have multiple merge passes, which will happen from time to time. > > > Now that the pre-reading is done in logtape.c, it doesn't stop at a run > boundary. For example, when we read the last 1 MB of the first run on a > tape, and we're using a 10 MB read buffer, we will merrily also read the > first 9 MB from the next run. You cannot un-read that data, even if the tape > is inactive in the next merge pass.
I've had a chance to think about this some more. I did a flying review of the same revision that I talk about here, but realized some more things. First, I will do a recap. > I don't think it makes much difference in practice, because most merge > passes use all, or almost all, of the available tapes. BTW, I think the > polyphase algorithm prefers to do all the merges that don't use all tapes > upfront, so that the last final merge always uses all the tapes. I'm not > 100% sure about that, but that's my understanding of the algorithm, and > that's what I've seen in my testing. Not sure that I understand. I agree that each merge pass tends to use roughly the same number of tapes, but the distribution of real runs on tapes is quite unbalanced in earlier merge passes (due to dummy runs). It looks like you're always using batch memory, even for non-final merges. Won't that fail to be in balance much of the time because of the lopsided distribution of runs? Tapes have an uneven amount of real data in earlier merge passes. FWIW, polyphase merge actually doesn't distribute runs based on the Fibonacci sequence (at least in Postgres). It uses a generalization of the Fibonacci numbers [1] -- *which* generalization varies according to merge order/maxTapes. IIRC, with Knuth's P == 6 (i.e. merge order == 6), it's the "hexanacci" sequence. The following code is from your latest version, posted on 2016-09-14, within the high-level mergeruns() function (the function that takes quicksorted runs, and produces final output following 1 or more merge passes): > + usedBlocks = 0; > + for (tapenum = 0; tapenum < state->maxTapes; tapenum++) > + { > + int64 numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0); > + > + if (numBlocks > MaxAllocSize / BLCKSZ) > + numBlocks = MaxAllocSize / BLCKSZ; > + LogicalTapeAssignReadBufferSize(state->tapeset, tapenum, > + numBlocks * BLCKSZ); > + usedBlocks += numBlocks; > + } > + USEMEM(state, usedBlocks * BLCKSZ); I'm basically repeating myself here, but: I think it's incorrect that LogicalTapeAssignReadBufferSize() is called so indiscriminately (more generally, it is questionable that it is called in such a high level routine, rather than the start of a specific merge pass -- I said so a couple of times already). It's weird that you change the meaning of maxTapes by reassigning in advance of iterating through maxTapes like this. I should point out again that the master branch mergebatch() function does roughly the same thing as you're doing here as follows: static void mergebatch(Tuplesortstate *state, int64 spacePerTape) { int srcTape; Assert(state->activeTapes > 0); Assert(state->tuples); /* * For the purposes of tuplesort's memory accounting, the batch allocation * is special, and regular memory accounting through USEMEM() calls is * abandoned (see mergeprereadone()). */ for (srcTape = 0; srcTape < state->maxTapes; srcTape++) { char *mergetuples; if (!state->mergeactive[srcTape]) continue; /* Allocate buffer for each active tape */ mergetuples = MemoryContextAllocHuge(state->tuplecontext, spacePerTape); /* Initialize state for tape */ state->mergetuples[srcTape] = mergetuples; state->mergecurrent[srcTape] = mergetuples; state->mergetail[srcTape] = mergetuples; state->mergeoverflow[srcTape] = NULL; } state->batchUsed = true; state->spacePerTape = spacePerTape; } Notably, this function: * Works without altering the meaning of maxTapes. state->maxTapes is Knuth's T, which is established very early and doesn't change with polyphase merge (same applies to state->tapeRange). What does change across merge passes is the number of *active* tapes. I don't think it's necessary to change that in any way. I find it very confusing. (Also, that you're using state->currentRun here at all seems bad, for more or less the same reason -- that's the number of quicksorted runs.) * Does allocation for the final merge (that's the only point that it's called), and so is not based on the number of active tapes that happen to be in play when merging begins at a high level (i.e., when mergeruns() is first called). Many tapes will be totally inactive by the final merge, so this seems completely necessary for multiple merge pass cases. End of recap. Here is some new information: I was previously confused on this last point, because I thought that logtape.c might be able to do something smart to recycle memory that is bound per-tape by calls to LogicalTapeAssignReadBufferSize() in your patch. But it doesn't: all the recycling stuff only happens for the much smaller buffers that are juggled and reused to pass tuples back to caller within tuplesort_gettuple_common(), etc -- not the logtape.c managed buffers. So, AFAICT there is no justification that I can see for not adopting these notable properties of mergebatch() for some analogous point in this patch. Actually, you should probably not get rid of mergebatch(), but instead call LogicalTapeAssignReadBufferSize() there. This change would mean that the big per-tape memory allocations would happen in the same place as before -- you'd just be asking logtape.c to do it for you, instead of allocating directly. Perhaps the practical consequences of not doing something closer to mergebatch() are debatable, but I suspect that there is little point in actually debating it. I think you might as well do it that way. No? [1] http://mathworld.wolfram.com/Fibonaccin-StepNumber.html -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers