On Mon, Jan 8, 2018 at 11:57 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> Ignoring some details around partial vs. non-partial plans, that's >> pretty much what we ARE doing, but to make it efficient, we sort the >> paths at plan time so that those choices are easy to make at runtime. >> If we didn't do that, we could have every worker sort the paths at >> execution time instead, or have the first process to arrive perform >> the sort and store the results in shared memory while everyone else >> waits, but that seems to be more complicated and less efficient, so I >> don't understand why you're proposing it. > > The main bit of info we'd have at runtime that we lack at plan time is > certainty about the number of available workers. Maybe that doesn't > really add anything useful to the order in which subplans would be doled > out; not sure.
The current algorithm doesn't have any use for that information; I'm not sure whether some other algorithm might. If we had perfect information, each participant would always choose the next task in such a way as to minimize overall execution time. Is it possible that the correct choice might depend on how many workers we have in total? /me thinks for a bit. Yes, that's possible. Suppose that the execution times of 5 non-partial plans are 100 99 60 40 1 and that the number of participants is either 2 or 3. Suppose further that the number of rows returned is nominal so that there's no issue of the worker(s) having to wait. If there are 2 participants, one process should get the plans with execution times 99 and 60 and the other should get the rest. If there are 3 participants, one process should get only 100, the second should get 99 and 1, and the last should get 60 and 40. In the actually implemented algorithm, the leader will grab the cheap plan at the end, the worker(s) will grab the expensive plans at the beginning, and things probably won't actually work out very nicely. With three participants, the leader will end up with 1 and 40, the first worker with 100 only, and the second worker with 99 and 60. Fail. The choice of which plan the leader takes might also depend on the number of rows being returned by the various plans. If there's one plan that returns a lot of rows, it would be nice to run that one in the leader, because then we save the cost of transferring those rows between processes. On the other hand, that also risks starving all of the workers; it could be faster to incur the overhead of transferring those rows so that the leader focuses on servicing the tuple queues rather than running plans. There are also problems with partial plans having high startup costs. In general, if a partial plan has a high startup cost, we'd like the leader not to choose it, because then it's not available to read tuples. So, if we have one plan with a startup cost with 1000 and another plan with a startup cost of 800, then we might think that the leader should prefer the latter. However, if the former plan is a parallel hash join and the latter is a non-parallel-aware join, then the reverse *may* be true, because the startup cost of parallel hash can (mostly) be *shared* among as many participants as we have, whereas non-parallel-aware nodes will *repeat* the startup work. Of course, the plan tree doesn't carry information about whether startup costs of parallel plans are likely to be amortized across workers or repeated in each one, and even if it did, it does not seem to be trivial to put that information to good use. Yet another problem with parallel append -- and really, with parallel planning in general -- is that it ignores the cost of resource contention between cooperating backends. If we do a Parallel Append over Parallel Seq Scan operations, spreading workers out figures to win if (1) there are enough of them to general significant contention on the in-memory data structures, which is especially likely if the data is resident in memory or if (2) the different partitions are on different filesystems that can satisfy I/O requests in parallel or even if (3) the different partitions are all on the same filesystem but that filesystem has enough spindles to sequential-scan them all at once. But it could easily lose if, say, the partitions are on different filesystems serviced by a single disk head, and the alternating I/O pattern produces a continuous stream of long seeks. On the the third hand, that case could work out too if the per-row processing is enough that we weren't I/O-bound in the first place. I'd be quite happy if somebody wanted to dig into these issues more. I don't pretend that the existing implementation is anything more than a first approximation of what we really want here. It's just based on the observations that (1) in practice, the leader is very often the bottleneck when executing parallel queries and (2) a task that takes a long time to run and can't be sped up by adding workers must get started as soon as possible. I have a feeling that to do this really well is going to require information that we don't have readily available, like which resources which subplans will use at which times and how many CPUs we need to use to render the query I/O-bound. However, there may be clear improvements that can be made with only localized changes, and I welcome ideas. Query planning appears to be a hard problem. :-p -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company