Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas Thu, 18 Jan 2018 09:23:41 -0800

On Thu, Jan 18, 2018 at 5:49 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> I'm mostly away from my computer this week -- sorry about that,


Yeah, seriously.  Since when it is OK for hackers to ever be away from
their computers?  :-)

> The idea I mentioned would only work if nworkers_launched is never
> over-reported in a scenario that doesn't error out or crash, and never
> under-reported in any scenario.  Otherwise static barriers may be even
> less useful than I thought.

I just went back to the thread on "parallel.c oblivion of
worker-startup failures" and refreshed my memory about what's going on
over there.  What's going on over there is (1) currently,
nworkers_launched can be over-reported in a scenario that doesn't
error out or crash and (2) I'm proposing to tighten things up so that
this is no longer the case.  Amit proposed making it the
responsibility of code that uses parallel.c to cope with
nworkers_launched being larger than the number that actually launched,
and my counter-proposal was to make it reliably ERROR when they don't
all launch.

So, thinking about this, I think that my proposal to use dynamic
barriers here seems like it will work regardless of who wins that
argument.  Your proposal to use static barriers and decrement the
party size based on the number of participants which fail to start
will work if I win that argument, but will not work if Amit wins that
argument.

It seems to me in general that dynamic barriers are to be preferred in
almost every circumstance, because static barriers require a longer
chain of assumptions.  We can't assume that the number of guests we
invite to the party will match the number that actually show up, so,
in the case of a static barrier, we have to make sure to adjust the
party size if some of the guests end up having to stay home with a
sick kid or their car breaks down or if they decide to go to the
neighbor's party instead.  Absentee guests are not intrinsically a
problem, but we have to make sure that we account for them in a
completely water-tight fashion.  On the other hand, with a dynamic
barrier, we don't need to worry about the guests that don't show up;
we only need to worry about the guests that DO show up.  As they come
in the door, we count them; as they leave, we count them again.  When
the numbers are equal, the party's over.  That seems more robust.

In particular, for parallel query, there is absolutely zero guarantee
that every worker reaches every plan node.  For a parallel utility
command, things seem a little better: we can assume  that workers are
started only for one particular purpose.  But even that might not be
true in the future.  For example, imagine a parallel CREATE INDEX on a
partitioned table that cascades to all children.  One can easily
imagine wanting to use the same workers for the whole operation and
spread them out across the pool of tasks much as Parallel Append does.
There's a good chance this will be faster than doing each index build
in turn with maximum parallelism.  And then the static barrier thing
goes right out the window again, because the number of participants is
determined dynamically.

I really struggle to think of any case where a static barrier is
better.   I mean, suppose we have an existing party and then decide to
hold a baking contest.  We'll use a barrier to separate the baking
phase from the judging phase.  One might think that, since the number
of participants is already decided, someone could initialize the
barrier with that number rather than making everyone attach.  But it
doesn't really work, because there's a race: while one process is
creating the barrier with participants = 10, the doctor's beeper goes
off and he leaves the party.  Now there could be some situation in
which we are absolutely certain that we know how many participants
we've got and it won't change, but I suspect that in almost every
scenario deciding to use a static barrier is going to be immediately
followed by a lot of angst about how we can be sure that the number of
participants will always be correct.

>> Okay. I'll work on adopting dynamic barriers in the way you described.
>> I just wanted to make sure that we're all on the same page about what
>> that looks like.
>
> Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach
> and then just exit, you'll need to detach from the barrier both in the
> case where the worker exits early because the phase is too high and
> the case where you attach in in time to help and run to completion;

In the first case, I guess this is because otherwise the other
participants will wait for us even though we're not really there any
more.  In the second case, I'm not sure why it matters whether we
detach.  If we've reached the highest possible phase number, nobody's
going to wait any more, so who cares?  (I mean, apart from tidiness.)

> (2) maybe workers could use BarrierArriveAndDetach() at the end (the
> leader needs to use BarrierArriveAndWait(), but the workers don't
> really need to wait for each other before they exit, do they?);

They don't need to wait for each other, but they do need to wait for
the leader, so I don't think this works.  Logically, there are two key
sequencing points.  First, the leader needs to wait for the workers to
finish sorting.  That's the barrier between phase 0 and phase 1.
Second, the workers need to wait for the leader to absorb their tapes.
That's the barrier between phase 1 and phase 2.  If the workers use
BarrierArriveAndWait to reach phase 1 and then BarrierArriveAndDetach,
they won't wait for the leader to be done adopting their tapes as they
do in the current patch.

But, hang on a minute.  Why do the workers need to wait for the leader
anyway?  Can't they just exit once they're done sorting?  I think the
original reason why this ended up in the patch is that we needed the
leader to assume ownership of the tapes to avoid having the tapes get
blown away when the worker exits.  But, IIUC, with sharedfileset.c,
that problem no longer exists.  The tapes are jointly owned by all of
the cooperating backends and the last one to detach from it will
remove them.  So, if the worker sorts, advertises that it's done in
shared memory, and exits, then nothing should get removed and the
leader can adopt the tapes whenever it gets around to it.

If that's correct, then we only need 2 phases, not 3.  Workers
BarrierAttach() before reading any data, exiting if the phase is not
0.  Otherwise, they then read data and sort it, then advertise the
final tape in shared memory, then BarrierArriveAndDetach().  The
leader does BarrierAttach() before launching any workers, then reads
data and sorts it if applicable, then does BarrierArriveAndWait().
When that returns, all workers are done sorting (and may or may not
have finished exiting) and the leader can take over their tapes and
everything is fine.  That's significantly simpler than my previous
outline, and also simpler than what the patch does today.

> (3)
> erm, maybe it's a problem that errors occurring in workers while the
> leader is waiting at a barrier won't unblock the leader (we don't
> detach from barriers on abort/exit) -- I'll look into this.

I think if there's an ERROR, the general parallelism machinery is
going to arrange to kill every worker, so nothing matters in that case
unless barrier waits ignore interrupts, which I'm pretty sure they
don't.  (Also: if they do, I'll hit the ceiling; that would be awful.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to