Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas Wed, 17 Jan 2018 10:41:11 -0800

On Mon, Jan 15, 2018 at 7:54 PM, Peter Geoghegan <p...@bowt.ie> wrote:
>> BufFileView() looks fairly pointless.  It basically creates a copy of
>> the input and, in so doing, destroys the input, which is a lot like
>> returning the input parameter except that it uses more cycles.  It
>> does do a few things.
>
> While it certainly did occur to me that that was kind of weird, and I
> struggled with it on my own for a little while, I ultimately agreed
> with Thomas that it added something to have ltsConcatWorkerTapes()
> call some buffile function in every iteration of its loop.
> (BufFileView() + BufFileViewAppend() are code that Thomas actually
> wrote, though I added the asserts and comments myself.)


Hmm, well, if Thomas contributed code to this patch, then he needs to
be listed as an author.  I went searching for an email on this thread
(or any other) where he posted code for this, thinking that there
might be some discussion explaining the motivation, but I didn't find
any.  I'm still in favor of erasing this distinction.

> If you think about this in terms of the interface rather than the
> implementation, then it may make more sense. The encapsulation adds
> something which might pay off later, such as when extendBufFile()
> needs to work with a concatenated set of BufFiles. And even right now,
> I cannot simply reuse the BufFile without then losing the assert that
> is currently in BufFileViewAppend() (must not have associated shared
> fileset assert). So I'd end up asserting less (rather than more) there
> if BufFileView() was removed.

I would see the encapsulation as having some value if the original
BufFile remained valid and the new view were also valid.  Then the
BufFileView operation is a bit like a copy-on-write filesystem
snapshot: you have the original, which you can do stuff with, and you
have a copy, which can be manipulated independently, but the copying
is cheap.  But here the BufFile gobbles up the original so I don't see
the point.

The  Assert(target->fileset == NULL) that would be lost in
BufFileViewAppend has no value anyway, AFAICS.  There is also
Assert(source->readOnly) given which the presence or absence of the
fileset makes no difference.  And if, as you say, extendBufFile were
eventually made to work here, this Assert would presumably get removed
anyway; I think we'd likely want the additional files to get
associated with the shared file set rather than being locally
temporary files.

> It wastes some cycles to not simply use the BufFile directly, but not
> terribly many in the grand scheme of things. This happens once per
> external sort operation.

I'm not at all concerned about the loss of cycles.  I'm concerned
about making the mechanism more complicated to understand and maintain
for future readers of the code.  When experienced hackers see code
that doesn't seem to accomplish anything, they (or at least I) tend to
assume that there must be a hidden reason for it to be there and spend
time trying to figure out what it is.  If there actually is no hidden
purpose, then that study is a waste of time and we can spare them the
trouble by getting rid of it now.

>> In miscadmin.h, I'd put the prototype for the new GUC next to
>> max_worker_processes, not maintenance_work_mem.
>
> But then I'd really have to put it next to max_worker_processes in
> globals.c, too. That would mean that it would go under "Primary
> determinants of sizes of shared-memory structures" within globals.c,
> which seems wrong to me. What do you think?

OK, that's a fair point.

> I think I still get the gist of what you're saying, though. I've come
> up with a new structure that is a noticeable improvement on what I
> had. Importantly, the new structure let me add a number of
> parallelism-agnostic asserts that make sure that every ambuild routine
> that supports parallelism gets the details right.

Yes, that looks better.  I'm slightly dubious that the new Asserts()
are worthwhile, but I guess it's OK.  But I think it would be better
to ditch the if-statement and do it like this:

Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin)
    : !TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);

Also, I think you've got a little more than you need in terms of
comments.  I would keep the comments for the serial case and parallel
case and drop the earlier one that basically says the same thing:

+     * (Note that parallel case never has us register/unregister snapshot, and
+     * provides appropriate snapshot for us.)

> There is a little bit of ambiguity about other cases, though -- that's
> the secondary point I tried to make within that comment block, and the
> part that you took issue with. To put this secondary point another
> way: It's possible that we'd fail to detect it if someone's comparator
> went bananas and decided it was okay to do SQL access (that resulted
> in an index scan of the index undergoing reindex). That does seem
> rather unlikely, but I felt it necessary to say something like this
> because ReindexIsProcessingIndex() isn't already something that only
> deals with catalog indexes -- it works with all indexes.

I agree that it isn't particularly likely, but if somebody found it
worthwhile to insert guards against those cases, maybe we should
preserve them instead of abandoning them.  It shouldn't be that hard
to propagate those values from the leader to the workers.  The main
difficulty there seems to be that we're creating the parallel context
in nbtsort.c, while the state that would need to be propagated is
private to index.c, but there are several ways to solve that problem.
It looks to me like the most robust approach would be to just make
that part of what parallel.c naturally does.  Patch for that attached.

> Here's what I've done based on your feedback: I've changed the header
> comments, but stopped leaderParticipates from affecting the
> compute_parallel_worker() calculation (so, as I said,
> leaderParticipates stays). The leaderParticipates argument continues
> to affect these two aspects of plan_create_index_workers()'s return
> value:
>
> 1. It continues to be used so we have a total number of participants
> (not workers) to apply our must-have-32MB-workMem limit on
> participants.
>
> Parallel query has no equivalent of this, and it seems warranted. Note
> that this limit is no longer applied when parallel_workers storage
> param was set, as discussed.
>
> 2. I continue to use the leaderParticipates argument to disallow the
> case where there is only one CREATE INDEX participant but parallelism
> is in use, because, of course, that clearly makes no sense -- we
> should just use a serial sort instead.

That's an improvement, but see below.

> (It might make sense to allow this if parallel_leader_participation
> was *purely* a testing GUC, only for use by by backend hackers, but
> AFAICT it isn't.)

As applied to parallel CREATE INDEX, it pretty much is just a testing
GUC, which is why I was skeptical about leaving support for it in the
patch.  There's no anticipated advantage to having the leader not
participate -- unlike for parallel queries, where it is quite possible
that setting parallel_leader_participation=off could be a win, even
generally.  If you just have a Gather over a parallel sequential scan,
it is unlikely that parallel_leader_participation=off will help; it
will most likely hurt, at least up to the point where more
participants become a bad idea in general due to contention.  However,
if you have a complex plan involving fairly-large operations that
cannot be divided up among workers, such as a Parallel Append or a
Hash Join with a big startup cost or a Sort that happens in the worker
or even a parallel Index Scan that takes a long time to advance to the
next page because it has to do I/O, you might leave workers idling
while the leader is trying to "help".  Some users may have workloads
where this is the normal case.  Ideally, the planner would figure out
whether this is likely and tell the leader whether or not to
participate, but we don't know how to figure that out yet.  On the
other hand, for CREATE INDEX, having the leader not participate can't
really improve anything.

In other words, right now, parallel_leader_participation is not
strictly a testing GUC, but if we make CREATE INDEX respect it, then
we're pushing it towards being a GUC that you don't ever want to
enable except for testing.  I'm still not sure that's a very good
idea, but if we're going to do it, then surely we should be
consistent.  It's true that having one worker and no parallel leader
participation can never be better than just having the leader do it,
but it is also true that having two leaders and no parallel leader
participation can never be better than having 1 worker with leader
participation.  I don't see a reason to treat those cases differently.

If we're going to keep parallel_leader_participation support here, I
think the last hunk in config.sgml should read more like this:

Allows the leader process to execute the query plan under
<literal>Gather</literal> and <literal>Gather Merge</literal> nodes
and to participate in parallel index builds.  The default is
<literal>on</literal>.  For queries, setting this value to
<literal>off</literal> reduces the likelihood that workers will become
blocked because the leader is not reading tuples fast enough, but
requires the leader process to wait for worker processes to start up
before the first tuples can be produced.  The degree to which the
leader can help or hinder performance depends on the plan type or
index build strategy, number of workers and query duration.  For index
builds, setting this value to <literal>off</literal> is expected to
reduce performance, but may be useful for testing purposes.

> I suspect that the parameters of any cost model for parallel CREATE
> INDEX that we're prepared to consider for v11 are: "Use a number of
> parallel workers that is one below the number at which the total
> duration of the CREATE INDEX either stays the same or goes up".

That's pretty much the definition of a correct cost model; the trick
is how to implement it without an oracle.

> BTW, the 32MB per participant limit within plan_create_index_workers()
> was chosen based on the realization that any higher value would make
> having a default setting of 2 for max_parallel_maintenance_workers (to
> match the max_parallel_workers_per_gather default) pointless when the
> default maintenance_work_mem value of 64MB is in use. That's not
> terribly scientific, though it at least doesn't come at the expense of
> a more scientific idea for a limit like that (I don't actually have
> one, you see). I am merely trying to avoid being *gratuitously*
> wasteful of shared resources that are difficult to accurately cost in
> (e.g., the distributed cost of random I/O to the system as a whole
> when we do a parallel index build while ridiculously low on
> maintenance_work_mem).

I see.  I think it's a good start.  I wonder in general whether it's
better to add memory or add workers.  In other words, suppose I have a
busy system where my index builds are slow.  Should I try to free up
some memory so that I can raise maintenance_work_mem, or should I try
to free up some CPU resources so I can raise
max_parallel_maintenance_workers?  The answer doubtless depends on the
current values that I have configured for those settings and the type
of data that I'm indexing, as well as how much memory I could free up
how easily and how much CPU I could free up how easily.  But I wish I
understood better than I do which one was more likely to help in a
given situation.

I also wonder what the next steps would be to make this whole thing
scale better.  From the performance tests that have been performed so
far, it seems like adding a modest number of workers definitely helps,
but it tops out around 2-3x with 4-8 workers.  I understand from your
previous comments that's typical of other databases.  It also seems
pretty clear that more memory helps but only to a point.  For
instance, I just tried "create index x on pgbench_accounts (aid)"
without your patch at scale factor 1000.  With maintenance_work_mem =
1MB, it generated 6689 runs and took 131 seconds.  With
maintenance_work_mem = 64MB, it took 67 seconds.  With
maintenance_work_mem = 1GB, it took 60 seconds.  More memory didn't
help, even if the sort could be made entirely internal. This seems to
be a fairly typical pattern: using enough memory can buy you a small
multiple, using a bunch of workers can buy you a small multiple, but
then it just doesn't get faster.  Yet, in theory, it seems like if
we're willing to provide essentially unlimited memory and CPU
resources, we ought to be able to make this go almost arbitrarily
fast.

>> I think that the naming of the wait events could be improved.  Right
>> now, they are named by which kind of process does the waiting, but it
>> really should be named based on what the thing for which we're
>> waiting.  I also suggest that we could just write Sort instead of
>> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
>> ParallelSortWorkersDone and ParallelTuplesortLeader ->
>> ParallelSortTapeHandover.
>
> WFM. Also added documentation for the wait events to monitoring.sgml,
> which I somehow missed the first time around.

But you forgot to update the preceding "morerows" line, so the
formatting will be all messed up.

>> +     * Make sure that the temp file(s) underlying the tape set are created 
>> in
>> +     * suitable temp tablespaces.  This is only really needed for serial
>> +     * sorts.
>>
>> This comment makes me wonder whether it is "sorta" needed for parallel sorts.
>
> I removed "really". The point of the comment is that we've already set
> up temp tablespaces for the shared fileset in the parallel case.
> Shared filesets figure out which tablespaces will be used up-front --
> see SharedFileSetInit().

So why not say it that way?  i.e. For parallel sorts, this should have
been done already, but it doesn't matter if it gets done twice.

 > I updated tuplesort_end() so that trace_sort reports on the end of the
> sort, even for worker processes. (We still don't show generic
> tuplesort_begin* message for workers, though.)

I don't see any reason not to make those contingent only on
trace_sort.  The user can puzzle apart which messages are which from
the PIDs in the logfile.

>> Instead of making a special case in CreateParallelContext for
>> serializable_okay, maybe index_build should just use SetConfigOption()
>> to force the isolation level to READ COMMITTED right after it does
>> NewGUCNestLevel().  The change would only be temporary because the
>> subsequent call to AtEOXact_GUC() will revert it.
>
> I tried doing it that way, but it doesn't seem workable:
>
> postgres=# begin transaction isolation level serializable ;
> BEGIN
> postgres=*# reindex index test_unique;
> ERROR:  25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
> LOCATION:  call_string_check_hook, guc.c:9953

Bummer.

> Instead, I added a bit more to comments within
> CreateParallelContext(), to justify what I've done along the lines you
> went into. Hopefully this works better for you.

Yeah, that seems OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

propagate-reindex-state-v1.patch
Description: Binary data

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to