On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > I think I can see why this patch needs that. Is it mainly for the > work you are doing in _bt_leader_heapscan where you are waiting for > all the workers to be finished?
Yes, though it's also needed for the leader tuplesort. It needs to be able to discover worker runs by looking for temp files named 0 through to $NWORKERS - 1. The problem with seeing who shows up after a period of time, and having the leader arbitrarily determine that to be the total number of participants (while blocking further participants from joining) is that I don't know *how long to wait*. This would probably work okay for parallel CREATE INDEX, when the leader participates as a worker, because you can check only when the leader is finished acting as a worker. It stands to reason that that's enough time for worker processes to at least show up, and be seen to show up. We can use the duration of the leader's participation as a worker as a natural way to decide how long to wait. But what when the leader doesn't participate as a worker, for whatever reason? Other uses for parallel tuplesort might typically have much less leader participation as compared to parallel CREATE INDEX. In short, ISTM that seeing who shows up is a bad strategy for parallel tuplesort. > I think till now we don't have any such requirement, but if it is must > for this patch, then I don't think it is tough to do that. We need to > write an API WaitForParallelWorkerToAttach() and then call for each > launched worker or maybe WaitForParallelWorkersToAttach() which can > wait for all workers to attach and report how many have successfully > attached. It will have functionality of > WaitForBackgroundWorkerStartup and additionally it needs to check if > the worker is attached to the error queue. We already have similar > API (WaitForReplicationWorkerAttach) for logical replication workers > as well. Note that it might have a slight impact on the performance > because with this you need to wait for the workers to startup before > doing any actual work, but I don't think it should be noticeable for > large operations especially for operations like parallel create index. Actually, though it doesn't really look like it from the way things are structured within nbtsort.c, I don't need to wait for workers to start up (call the WaitForParallelWorkerToAttach() function you sketched) before doing any real work within the leader. The leader can participate as a worker, and only do this check afterwards. That will work because the leader Tuplesortstate has yet to do any real work. Nothing stops me from adding a new function to tuplesort, for the leader, that lets the leader say: "New plan -- you should now expect this many participants" (leader takes this reliable number from eventual call to WaitForParallelWorkerToAttach()). I admit that I had no idea that there is this issue with nworkers_launched until very recently. But then, that field has absolutely no comments. -- Peter Geoghegan