Re: Parallel append plan instability/randomness

Robert Haas Mon, 08 Jan 2018 09:46:53 -0800

On Mon, Jan 8, 2018 at 11:57 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Ignoring some details around partial vs. non-partial plans, that's
>> pretty much what we ARE doing, but to make it efficient, we sort the
>> paths at plan time so that those choices are easy to make at runtime.
>> If we didn't do that, we could have every worker sort the paths at
>> execution time instead, or have the first process to arrive perform
>> the sort and store the results in shared memory while everyone else
>> waits, but that seems to be more complicated and less efficient, so I
>> don't understand why you're proposing it.
>
> The main bit of info we'd have at runtime that we lack at plan time is
> certainty about the number of available workers.  Maybe that doesn't
> really add anything useful to the order in which subplans would be doled
> out; not sure.


The current algorithm doesn't have any use for that information; I'm
not sure whether some other algorithm might.  If we had perfect
information, each participant would always choose the next task in
such a way as to minimize overall execution time.  Is it possible that
the correct choice might depend on how many workers we have in total?

/me thinks for a bit.  Yes, that's possible. Suppose that the
execution times of 5 non-partial plans are 100 99 60 40 1 and that the
number of participants is either 2 or 3.  Suppose further that the
number of rows returned is nominal so that there's no issue of the
worker(s) having to wait.  If there are 2 participants, one process
should get the plans with execution times 99 and 60 and the other
should get the rest.  If there are 3 participants, one process should
get only 100, the second should get 99 and 1, and the last should get
60 and 40.

In the actually implemented algorithm, the leader will grab the cheap
plan at the end, the worker(s) will grab the expensive plans at the
beginning, and things probably won't actually work out very nicely.
With three participants, the leader will end up with 1 and 40, the
first worker with 100 only, and the second worker with 99 and 60.
Fail.

The choice of which plan the leader takes might also depend on the
number of rows being returned by the various plans.  If there's one
plan that returns a lot of rows, it would be nice to run that one in
the leader, because then we save the cost of transferring those rows
between processes.  On the other hand, that also risks starving all of
the workers; it could be faster to incur the overhead of transferring
those rows so that the leader focuses on servicing the tuple queues
rather than running plans.

There are also problems with partial plans having high startup costs.
In general, if a partial plan has a high startup cost, we'd like the
leader not to choose it, because then it's not available to read
tuples.  So, if we have one plan with a startup cost with 1000 and
another plan with a startup cost of 800, then we might think that the
leader should prefer the latter.  However, if the former plan is a
parallel hash join and the latter is a non-parallel-aware join, then
the reverse *may* be true, because the startup cost of parallel hash
can (mostly) be *shared* among as many participants as we have,
whereas non-parallel-aware nodes will *repeat* the startup work.  Of
course, the plan tree doesn't carry information about whether startup
costs of parallel plans are likely to be amortized across workers or
repeated in each one, and even if it did, it does not seem to be
trivial to put that information to good use.

Yet another problem with parallel append -- and really, with parallel
planning in general -- is that it ignores the cost of resource
contention between cooperating backends.  If we do a Parallel Append
over Parallel Seq Scan operations, spreading workers out figures to
win if (1) there are enough of them to general significant contention
on the in-memory data structures, which is especially likely if the
data is resident in memory or if (2) the different partitions are on
different filesystems that can satisfy I/O requests in parallel or
even if (3) the different partitions are all on the same filesystem
but that filesystem has enough spindles to sequential-scan them all at
once.  But it could easily lose if, say, the partitions are on
different filesystems serviced by a single disk head, and the
alternating I/O pattern produces a continuous stream of long seeks.
On the the third hand, that case could work out too if the per-row
processing is enough that we weren't I/O-bound in the first place.

I'd be quite happy if somebody wanted to dig into these issues more.
I don't pretend that the existing implementation is anything more than
a first approximation of what we really want here.  It's just based on
the observations that (1) in practice, the leader is very often the
bottleneck when executing parallel queries and (2) a task that takes a
long time to run and can't be sped up by adding workers must get
started as soon as possible.  I have a feeling that to do this really
well is going to require information that we don't have readily
available, like which resources which subplans will use at which times
and how many CPUs we need to use to render the query I/O-bound.
However, there may be clear improvements that can be made with only
localized changes, and I welcome ideas.  Query planning appears to be
a hard problem.  :-p

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel append plan instability/randomness

Reply via email to