Re: [HACKERS] Tuplesort merge pre-reading

Peter Geoghegan Wed, 28 Sep 2016 08:16:29 -0700

On Thu, Sep 15, 2016 at 9:51 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
>> I still don't get why you're doing all of this within mergeruns() (the
>> beginning of when we start merging -- we merge all quicksorted runs),
>> rather than within beginmerge() (the beginning of one particular merge
>> pass, of which there are potentially more than one). As runs are
>> merged in a non-final merge pass, fewer tapes will remain active for
>> the next merge pass. It doesn't do to do all that up-front when we
>> have multiple merge passes, which will happen from time to time.
>
>
> Now that the pre-reading is done in logtape.c, it doesn't stop at a run
> boundary. For example, when we read the last 1 MB of the first run on a
> tape, and we're using a 10 MB read buffer, we will merrily also read the
> first 9 MB from the next run. You cannot un-read that data, even if the tape
> is inactive in the next merge pass.


I've had a chance to think about this some more. I did a flying review
of the same revision that I talk about here, but realized some more
things. First, I will do a recap.

> I don't think it makes much difference in practice, because most merge
> passes use all, or almost all, of the available tapes. BTW, I think the
> polyphase algorithm prefers to do all the merges that don't use all tapes
> upfront, so that the last final merge always uses all the tapes. I'm not
> 100% sure about that, but that's my understanding of the algorithm, and
> that's what I've seen in my testing.

Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.

FWIW, polyphase merge actually doesn't distribute runs based on the
Fibonacci sequence (at least in Postgres). It uses a generalization of
the Fibonacci numbers [1] -- *which* generalization varies according
to merge order/maxTapes. IIRC, with Knuth's P == 6 (i.e. merge order
== 6), it's the "hexanacci" sequence.

The following code is from your latest version, posted on 2016-09-14,
within the high-level mergeruns() function (the function that takes
quicksorted runs, and produces final output following 1 or more merge
passes):

> +   usedBlocks = 0;
> +   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
> +   {
> +       int64       numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
> +
> +       if (numBlocks > MaxAllocSize / BLCKSZ)
> +           numBlocks = MaxAllocSize / BLCKSZ;
> +       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
> +                                       numBlocks * BLCKSZ);
> +       usedBlocks += numBlocks;
> +   }
> +   USEMEM(state, usedBlocks * BLCKSZ);

I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).

It's weird that you change the meaning of maxTapes by reassigning in
advance of iterating through maxTapes like this. I should point out
again that the master branch mergebatch() function does roughly the
same thing as you're doing here as follows:

static void
mergebatch(Tuplesortstate *state, int64 spacePerTape)
{
    int         srcTape;

    Assert(state->activeTapes > 0);
    Assert(state->tuples);

    /*
     * For the purposes of tuplesort's memory accounting, the batch allocation
     * is special, and regular memory accounting through USEMEM() calls is
     * abandoned (see mergeprereadone()).
     */
    for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
    {
        char       *mergetuples;

        if (!state->mergeactive[srcTape])
            continue;

        /* Allocate buffer for each active tape */
        mergetuples = MemoryContextAllocHuge(state->tuplecontext,
                                             spacePerTape);

        /* Initialize state for tape */
        state->mergetuples[srcTape] = mergetuples;
        state->mergecurrent[srcTape] = mergetuples;
        state->mergetail[srcTape] = mergetuples;
        state->mergeoverflow[srcTape] = NULL;
    }

    state->batchUsed = true;
    state->spacePerTape = spacePerTape;
}

Notably, this function:

* Works without altering the meaning of maxTapes. state->maxTapes is
Knuth's T, which is established very early and doesn't change with
polyphase merge (same applies to state->tapeRange). What does change
across merge passes is the number of *active* tapes. I don't think
it's necessary to change that in any way. I find it very confusing.
(Also, that you're using state->currentRun here at all seems bad, for
more or less the same reason -- that's the number of quicksorted
runs.)

* Does allocation for the final merge (that's the only point that it's
called), and so is not based on the number of active tapes that happen
to be in play when merging begins at a high level (i.e., when
mergeruns() is first called). Many tapes will be totally inactive by
the final merge, so this seems completely necessary for multiple merge
pass cases.

End of recap. Here is some new information:

I was previously confused on this last point, because I thought that
logtape.c might be able to do something smart to recycle memory that
is bound per-tape by calls to LogicalTapeAssignReadBufferSize() in
your patch. But it doesn't: all the recycling stuff only happens for
the much smaller buffers that are juggled and reused to pass tuples
back to caller within tuplesort_gettuple_common(), etc -- not the
logtape.c managed buffers. So, AFAICT there is no justification that I
can see for not adopting these notable properties of mergebatch() for
some analogous point in this patch. Actually, you should probably not
get rid of mergebatch(), but instead call
LogicalTapeAssignReadBufferSize() there. This change would mean that
the big per-tape memory allocations would happen in the same place as
before -- you'd just be asking logtape.c to do it for you, instead of
allocating directly.

Perhaps the practical consequences of not doing something closer to
mergebatch() are debatable, but I suspect that there is little point
in actually debating it. I think you might as well do it that way. No?

[1] http://mathworld.wolfram.com/Fibonaccin-StepNumber.html
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

Reply via email to