On Mon, Jan 15, 2018 at 7:54 PM, Peter Geoghegan <p...@bowt.ie> wrote: >> BufFileView() looks fairly pointless. It basically creates a copy of >> the input and, in so doing, destroys the input, which is a lot like >> returning the input parameter except that it uses more cycles. It >> does do a few things. > > While it certainly did occur to me that that was kind of weird, and I > struggled with it on my own for a little while, I ultimately agreed > with Thomas that it added something to have ltsConcatWorkerTapes() > call some buffile function in every iteration of its loop. > (BufFileView() + BufFileViewAppend() are code that Thomas actually > wrote, though I added the asserts and comments myself.)
Hmm, well, if Thomas contributed code to this patch, then he needs to be listed as an author. I went searching for an email on this thread (or any other) where he posted code for this, thinking that there might be some discussion explaining the motivation, but I didn't find any. I'm still in favor of erasing this distinction. > If you think about this in terms of the interface rather than the > implementation, then it may make more sense. The encapsulation adds > something which might pay off later, such as when extendBufFile() > needs to work with a concatenated set of BufFiles. And even right now, > I cannot simply reuse the BufFile without then losing the assert that > is currently in BufFileViewAppend() (must not have associated shared > fileset assert). So I'd end up asserting less (rather than more) there > if BufFileView() was removed. I would see the encapsulation as having some value if the original BufFile remained valid and the new view were also valid. Then the BufFileView operation is a bit like a copy-on-write filesystem snapshot: you have the original, which you can do stuff with, and you have a copy, which can be manipulated independently, but the copying is cheap. But here the BufFile gobbles up the original so I don't see the point. The Assert(target->fileset == NULL) that would be lost in BufFileViewAppend has no value anyway, AFAICS. There is also Assert(source->readOnly) given which the presence or absence of the fileset makes no difference. And if, as you say, extendBufFile were eventually made to work here, this Assert would presumably get removed anyway; I think we'd likely want the additional files to get associated with the shared file set rather than being locally temporary files. > It wastes some cycles to not simply use the BufFile directly, but not > terribly many in the grand scheme of things. This happens once per > external sort operation. I'm not at all concerned about the loss of cycles. I'm concerned about making the mechanism more complicated to understand and maintain for future readers of the code. When experienced hackers see code that doesn't seem to accomplish anything, they (or at least I) tend to assume that there must be a hidden reason for it to be there and spend time trying to figure out what it is. If there actually is no hidden purpose, then that study is a waste of time and we can spare them the trouble by getting rid of it now. >> In miscadmin.h, I'd put the prototype for the new GUC next to >> max_worker_processes, not maintenance_work_mem. > > But then I'd really have to put it next to max_worker_processes in > globals.c, too. That would mean that it would go under "Primary > determinants of sizes of shared-memory structures" within globals.c, > which seems wrong to me. What do you think? OK, that's a fair point. > I think I still get the gist of what you're saying, though. I've come > up with a new structure that is a noticeable improvement on what I > had. Importantly, the new structure let me add a number of > parallelism-agnostic asserts that make sure that every ambuild routine > that supports parallelism gets the details right. Yes, that looks better. I'm slightly dubious that the new Asserts() are worthwhile, but I guess it's OK. But I think it would be better to ditch the if-statement and do it like this: Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot)); Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) : !TransactionIdIsValid(OldestXmin)); Assert(snapshot == SnapshotAny || !anyvisible); Also, I think you've got a little more than you need in terms of comments. I would keep the comments for the serial case and parallel case and drop the earlier one that basically says the same thing: + * (Note that parallel case never has us register/unregister snapshot, and + * provides appropriate snapshot for us.) > There is a little bit of ambiguity about other cases, though -- that's > the secondary point I tried to make within that comment block, and the > part that you took issue with. To put this secondary point another > way: It's possible that we'd fail to detect it if someone's comparator > went bananas and decided it was okay to do SQL access (that resulted > in an index scan of the index undergoing reindex). That does seem > rather unlikely, but I felt it necessary to say something like this > because ReindexIsProcessingIndex() isn't already something that only > deals with catalog indexes -- it works with all indexes. I agree that it isn't particularly likely, but if somebody found it worthwhile to insert guards against those cases, maybe we should preserve them instead of abandoning them. It shouldn't be that hard to propagate those values from the leader to the workers. The main difficulty there seems to be that we're creating the parallel context in nbtsort.c, while the state that would need to be propagated is private to index.c, but there are several ways to solve that problem. It looks to me like the most robust approach would be to just make that part of what parallel.c naturally does. Patch for that attached. > Here's what I've done based on your feedback: I've changed the header > comments, but stopped leaderParticipates from affecting the > compute_parallel_worker() calculation (so, as I said, > leaderParticipates stays). The leaderParticipates argument continues > to affect these two aspects of plan_create_index_workers()'s return > value: > > 1. It continues to be used so we have a total number of participants > (not workers) to apply our must-have-32MB-workMem limit on > participants. > > Parallel query has no equivalent of this, and it seems warranted. Note > that this limit is no longer applied when parallel_workers storage > param was set, as discussed. > > 2. I continue to use the leaderParticipates argument to disallow the > case where there is only one CREATE INDEX participant but parallelism > is in use, because, of course, that clearly makes no sense -- we > should just use a serial sort instead. That's an improvement, but see below. > (It might make sense to allow this if parallel_leader_participation > was *purely* a testing GUC, only for use by by backend hackers, but > AFAICT it isn't.) As applied to parallel CREATE INDEX, it pretty much is just a testing GUC, which is why I was skeptical about leaving support for it in the patch. There's no anticipated advantage to having the leader not participate -- unlike for parallel queries, where it is quite possible that setting parallel_leader_participation=off could be a win, even generally. If you just have a Gather over a parallel sequential scan, it is unlikely that parallel_leader_participation=off will help; it will most likely hurt, at least up to the point where more participants become a bad idea in general due to contention. However, if you have a complex plan involving fairly-large operations that cannot be divided up among workers, such as a Parallel Append or a Hash Join with a big startup cost or a Sort that happens in the worker or even a parallel Index Scan that takes a long time to advance to the next page because it has to do I/O, you might leave workers idling while the leader is trying to "help". Some users may have workloads where this is the normal case. Ideally, the planner would figure out whether this is likely and tell the leader whether or not to participate, but we don't know how to figure that out yet. On the other hand, for CREATE INDEX, having the leader not participate can't really improve anything. In other words, right now, parallel_leader_participation is not strictly a testing GUC, but if we make CREATE INDEX respect it, then we're pushing it towards being a GUC that you don't ever want to enable except for testing. I'm still not sure that's a very good idea, but if we're going to do it, then surely we should be consistent. It's true that having one worker and no parallel leader participation can never be better than just having the leader do it, but it is also true that having two leaders and no parallel leader participation can never be better than having 1 worker with leader participation. I don't see a reason to treat those cases differently. If we're going to keep parallel_leader_participation support here, I think the last hunk in config.sgml should read more like this: Allows the leader process to execute the query plan under <literal>Gather</literal> and <literal>Gather Merge</literal> nodes and to participate in parallel index builds. The default is <literal>on</literal>. For queries, setting this value to <literal>off</literal> reduces the likelihood that workers will become blocked because the leader is not reading tuples fast enough, but requires the leader process to wait for worker processes to start up before the first tuples can be produced. The degree to which the leader can help or hinder performance depends on the plan type or index build strategy, number of workers and query duration. For index builds, setting this value to <literal>off</literal> is expected to reduce performance, but may be useful for testing purposes. > I suspect that the parameters of any cost model for parallel CREATE > INDEX that we're prepared to consider for v11 are: "Use a number of > parallel workers that is one below the number at which the total > duration of the CREATE INDEX either stays the same or goes up". That's pretty much the definition of a correct cost model; the trick is how to implement it without an oracle. > BTW, the 32MB per participant limit within plan_create_index_workers() > was chosen based on the realization that any higher value would make > having a default setting of 2 for max_parallel_maintenance_workers (to > match the max_parallel_workers_per_gather default) pointless when the > default maintenance_work_mem value of 64MB is in use. That's not > terribly scientific, though it at least doesn't come at the expense of > a more scientific idea for a limit like that (I don't actually have > one, you see). I am merely trying to avoid being *gratuitously* > wasteful of shared resources that are difficult to accurately cost in > (e.g., the distributed cost of random I/O to the system as a whole > when we do a parallel index build while ridiculously low on > maintenance_work_mem). I see. I think it's a good start. I wonder in general whether it's better to add memory or add workers. In other words, suppose I have a busy system where my index builds are slow. Should I try to free up some memory so that I can raise maintenance_work_mem, or should I try to free up some CPU resources so I can raise max_parallel_maintenance_workers? The answer doubtless depends on the current values that I have configured for those settings and the type of data that I'm indexing, as well as how much memory I could free up how easily and how much CPU I could free up how easily. But I wish I understood better than I do which one was more likely to help in a given situation. I also wonder what the next steps would be to make this whole thing scale better. From the performance tests that have been performed so far, it seems like adding a modest number of workers definitely helps, but it tops out around 2-3x with 4-8 workers. I understand from your previous comments that's typical of other databases. It also seems pretty clear that more memory helps but only to a point. For instance, I just tried "create index x on pgbench_accounts (aid)" without your patch at scale factor 1000. With maintenance_work_mem = 1MB, it generated 6689 runs and took 131 seconds. With maintenance_work_mem = 64MB, it took 67 seconds. With maintenance_work_mem = 1GB, it took 60 seconds. More memory didn't help, even if the sort could be made entirely internal. This seems to be a fairly typical pattern: using enough memory can buy you a small multiple, using a bunch of workers can buy you a small multiple, but then it just doesn't get faster. Yet, in theory, it seems like if we're willing to provide essentially unlimited memory and CPU resources, we ought to be able to make this go almost arbitrarily fast. >> I think that the naming of the wait events could be improved. Right >> now, they are named by which kind of process does the waiting, but it >> really should be named based on what the thing for which we're >> waiting. I also suggest that we could just write Sort instead of >> Tuplesort. In short, I suggest ParallelTuplesortLeader -> >> ParallelSortWorkersDone and ParallelTuplesortLeader -> >> ParallelSortTapeHandover. > > WFM. Also added documentation for the wait events to monitoring.sgml, > which I somehow missed the first time around. But you forgot to update the preceding "morerows" line, so the formatting will be all messed up. >> + * Make sure that the temp file(s) underlying the tape set are created >> in >> + * suitable temp tablespaces. This is only really needed for serial >> + * sorts. >> >> This comment makes me wonder whether it is "sorta" needed for parallel sorts. > > I removed "really". The point of the comment is that we've already set > up temp tablespaces for the shared fileset in the parallel case. > Shared filesets figure out which tablespaces will be used up-front -- > see SharedFileSetInit(). So why not say it that way? i.e. For parallel sorts, this should have been done already, but it doesn't matter if it gets done twice. > I updated tuplesort_end() so that trace_sort reports on the end of the > sort, even for worker processes. (We still don't show generic > tuplesort_begin* message for workers, though.) I don't see any reason not to make those contingent only on trace_sort. The user can puzzle apart which messages are which from the PIDs in the logfile. >> Instead of making a special case in CreateParallelContext for >> serializable_okay, maybe index_build should just use SetConfigOption() >> to force the isolation level to READ COMMITTED right after it does >> NewGUCNestLevel(). The change would only be temporary because the >> subsequent call to AtEOXact_GUC() will revert it. > > I tried doing it that way, but it doesn't seem workable: > > postgres=# begin transaction isolation level serializable ; > BEGIN > postgres=*# reindex index test_unique; > ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any query > LOCATION: call_string_check_hook, guc.c:9953 Bummer. > Instead, I added a bit more to comments within > CreateParallelContext(), to justify what I've done along the lines you > went into. Hopefully this works better for you. Yeah, that seems OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
propagate-reindex-state-v1.patch
Description: Binary data