On Thu, Jan 18, 2018 at 5:49 AM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > I'm mostly away from my computer this week -- sorry about that,
Yeah, seriously. Since when it is OK for hackers to ever be away from their computers? :-) > The idea I mentioned would only work if nworkers_launched is never > over-reported in a scenario that doesn't error out or crash, and never > under-reported in any scenario. Otherwise static barriers may be even > less useful than I thought. I just went back to the thread on "parallel.c oblivion of worker-startup failures" and refreshed my memory about what's going on over there. What's going on over there is (1) currently, nworkers_launched can be over-reported in a scenario that doesn't error out or crash and (2) I'm proposing to tighten things up so that this is no longer the case. Amit proposed making it the responsibility of code that uses parallel.c to cope with nworkers_launched being larger than the number that actually launched, and my counter-proposal was to make it reliably ERROR when they don't all launch. So, thinking about this, I think that my proposal to use dynamic barriers here seems like it will work regardless of who wins that argument. Your proposal to use static barriers and decrement the party size based on the number of participants which fail to start will work if I win that argument, but will not work if Amit wins that argument. It seems to me in general that dynamic barriers are to be preferred in almost every circumstance, because static barriers require a longer chain of assumptions. We can't assume that the number of guests we invite to the party will match the number that actually show up, so, in the case of a static barrier, we have to make sure to adjust the party size if some of the guests end up having to stay home with a sick kid or their car breaks down or if they decide to go to the neighbor's party instead. Absentee guests are not intrinsically a problem, but we have to make sure that we account for them in a completely water-tight fashion. On the other hand, with a dynamic barrier, we don't need to worry about the guests that don't show up; we only need to worry about the guests that DO show up. As they come in the door, we count them; as they leave, we count them again. When the numbers are equal, the party's over. That seems more robust. In particular, for parallel query, there is absolutely zero guarantee that every worker reaches every plan node. For a parallel utility command, things seem a little better: we can assume that workers are started only for one particular purpose. But even that might not be true in the future. For example, imagine a parallel CREATE INDEX on a partitioned table that cascades to all children. One can easily imagine wanting to use the same workers for the whole operation and spread them out across the pool of tasks much as Parallel Append does. There's a good chance this will be faster than doing each index build in turn with maximum parallelism. And then the static barrier thing goes right out the window again, because the number of participants is determined dynamically. I really struggle to think of any case where a static barrier is better. I mean, suppose we have an existing party and then decide to hold a baking contest. We'll use a barrier to separate the baking phase from the judging phase. One might think that, since the number of participants is already decided, someone could initialize the barrier with that number rather than making everyone attach. But it doesn't really work, because there's a race: while one process is creating the barrier with participants = 10, the doctor's beeper goes off and he leaves the party. Now there could be some situation in which we are absolutely certain that we know how many participants we've got and it won't change, but I suspect that in almost every scenario deciding to use a static barrier is going to be immediately followed by a lot of angst about how we can be sure that the number of participants will always be correct. >> Okay. I'll work on adopting dynamic barriers in the way you described. >> I just wanted to make sure that we're all on the same page about what >> that looks like. > > Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach > and then just exit, you'll need to detach from the barrier both in the > case where the worker exits early because the phase is too high and > the case where you attach in in time to help and run to completion; In the first case, I guess this is because otherwise the other participants will wait for us even though we're not really there any more. In the second case, I'm not sure why it matters whether we detach. If we've reached the highest possible phase number, nobody's going to wait any more, so who cares? (I mean, apart from tidiness.) > (2) maybe workers could use BarrierArriveAndDetach() at the end (the > leader needs to use BarrierArriveAndWait(), but the workers don't > really need to wait for each other before they exit, do they?); They don't need to wait for each other, but they do need to wait for the leader, so I don't think this works. Logically, there are two key sequencing points. First, the leader needs to wait for the workers to finish sorting. That's the barrier between phase 0 and phase 1. Second, the workers need to wait for the leader to absorb their tapes. That's the barrier between phase 1 and phase 2. If the workers use BarrierArriveAndWait to reach phase 1 and then BarrierArriveAndDetach, they won't wait for the leader to be done adopting their tapes as they do in the current patch. But, hang on a minute. Why do the workers need to wait for the leader anyway? Can't they just exit once they're done sorting? I think the original reason why this ended up in the patch is that we needed the leader to assume ownership of the tapes to avoid having the tapes get blown away when the worker exits. But, IIUC, with sharedfileset.c, that problem no longer exists. The tapes are jointly owned by all of the cooperating backends and the last one to detach from it will remove them. So, if the worker sorts, advertises that it's done in shared memory, and exits, then nothing should get removed and the leader can adopt the tapes whenever it gets around to it. If that's correct, then we only need 2 phases, not 3. Workers BarrierAttach() before reading any data, exiting if the phase is not 0. Otherwise, they then read data and sort it, then advertise the final tape in shared memory, then BarrierArriveAndDetach(). The leader does BarrierAttach() before launching any workers, then reads data and sorts it if applicable, then does BarrierArriveAndWait(). When that returns, all workers are done sorting (and may or may not have finished exiting) and the leader can take over their tapes and everything is fine. That's significantly simpler than my previous outline, and also simpler than what the patch does today. > (3) > erm, maybe it's a problem that errors occurring in workers while the > leader is waiting at a barrier won't unblock the leader (we don't > detach from barriers on abort/exit) -- I'll look into this. I think if there's an ERROR, the general parallelism machinery is going to arrange to kill every worker, so nothing matters in that case unless barrier waits ignore interrupts, which I'm pretty sure they don't. (Also: if they do, I'll hit the ceiling; that would be awful.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company