On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote: > Addressed all your comments one way or another, new patch attached. Comments > on some specific points below:
Cool. My response here is written under time pressure, which is not ideal. I think it's still useful, though. >> * You should probably point out that typically, access to batch memory >> will still be sequential, despite your block-based scheme. > That's not true, the "buffers" in batch memory are not accessed > sequentially. When we pull the next tuple from a tape, we store it in the > next free buffer. Usually, that buffer was used to hold the previous tuple > that was returned from gettuple(), and was just added to the free list. > > It's still quite cache-friendly, though, because we only need a small number > of slots (one for each tape). That's kind of what I meant, I think -- it's more or less sequential. Especially in the common case where there is only one merge pass. > True. I fixed that by putting a MaxAllocSize cap on the buffer size instead. > I doubt that doing > 1 GB of read-ahead of a single tape will do any good. You may well be right about that, but ideally that could be verified. I think that the tuplesort is entitled to have whatever memory the user makes available, unless that's almost certainly useless. It doesn't seem like our job to judge that it's always wrong to use extra memory with only a small expected benefit. If it's actually a microscopic expected benefit, or just as often negative to the sort operation's performance, then I'd say it's okay to cap it at MaxAllocSize. But it's not obvious to me that this is true; not yet, anyway. > Hmm. We don't really need the availMem accounting at all, after we have > started merging. There is nothing we can do to free memory if we run out, > and we use fairly little memory anyway. But yes, better safe than sorry. I > tried to clarify the comments on that. It is true that we don't really care about the accounting at all. But, the same applies to the existing grow_memtuples() case at the beginning of merging. The point is, we *do* care about availMem, this one last time. We must at least produce a sane (i.e. >= 0) number in any calculation. (I think you understand this already -- just saying.) > OK. I solved that by calling LogicalTapeRewind(), when we're done reading a > tape. Rewinding a tape has the side-effect of freeing the buffer. I was > going to put that into mergereadnext(), but it turns out that it's tricky to > figure out if there are any more runs on the same tape, because we have the > "actual" tape number there, but the tp_runs is indexed by "logical" tape > number. So I put the rewind calls at the end of mergeruns(), and in > TSS_FINALMERGE processing, instead. It means that we don't free the buffers > quite as early as we could, but I think this is good enough. That seems adequate. >> Spotted another issue with this code just now. Shouldn't it be based >> on state->tapeRange? You don't want the destTape to get memory, since >> you don't use batch memory for tapes that are written to (and final >> on-the-fly merges don't use their destTape at all). >> Wait, you're using a local variable maxTapes here, which potentially >> differs from state->maxTapes: > I changed that so that it does actually change state->maxTapes. I considered > having a separate numTapes field, that can be smaller than maxTapes, but we > don't need the original maxTapes value after that point anymore, so it > would've been just pro forma to track them separately. I hope the comment > now explains that better. I still don't get why you're doing all of this within mergeruns() (the beginning of when we start merging -- we merge all quicksorted runs), rather than within beginmerge() (the beginning of one particular merge pass, of which there are potentially more than one). As runs are merged in a non-final merge pass, fewer tapes will remain active for the next merge pass. It doesn't do to do all that up-front when we have multiple merge passes, which will happen from time to time. Correct me if I'm wrong, but I think that you're more skeptical of the need for polyphase merge than I am. I at least see no reason to not keep it around. I also think it has some value. It doesn't make this optimization any harder, really. > Hmm, yes, using currentRun here is wrong. It needs to be "currentRun + 1", > because we need one more tape than there are runs, to hold the output. As I said, I think it should be the number of active tapes, as you see today within beginmerge() + mergebatch(). Why not do it that way? If you don't get the distinction, see my remarks below on final merges always using batch memory, even when there are to be multiple merge passes (no reason to add that restriction here). More generally, I really don't want to mess with the long standing definition of maxTapes and things like that, because I see no advantage. > Ah, no, the "+ 1" comes from the need to hold the tuple that we last > returned to the caller in tuplesort_gettuple, until the next call. See > lastReturnedTuple. I tried to clarify the comments on that. I see. I don't think that you need to do any of that. Just have the sizing be based on an even share among activeTapes on final merges (not necessarily final on-the-fly merges, per my 0002-* patch -- you've done that here, it looks like). However, It looks like you're doing the wrong thing by only having the check at the top of beginmerge() -- "if (state->Level == 1)". You're not going to use batch memory for the final merge just because there happened to be multiple merges, that way. Sure, you shouldn't be using batch memory for non-final merges (due to their need for an uneven amount of memory per active tape), which you're not doing, but the mere fact that you had a non-final merge should not affect the final merge at all. Which is the kind of thing I'm concerned about when I talk about beginmerge() being the right place for all that new code (not mergeruns()). Even if you can make it work, it fits a lot better to put it in beginmerge() -- that's the existing flow of things, which should be preserved. I don't think you need to examine "state->Level" or "state->currentRun" at all. Did you do it this way to make the "replacement selection best case" stuff within mergeruns() catch on, so it does the right thing during its TSS_SORTEDONTAPE processing? Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers