On Tue, Jan 7, 2020 at 4:14 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> * I am uneasy about BarrierArriveExplicitAndWait() (a variant of > BarrierArriveAndWait() that lets you skip directly to a given phase?); > perhaps you only needed that for a circular phase system, which you > could do with modular phase numbers, like PHJ_GROW_BATCHES_PHASE? I > tried to make the barrier interfaces look like the libraries in other > parallel programming environments, and I'd be worried that the > explicit phase thing could easily lead to bugs. > So, I actually use it to circle back up to the first phase while skipping the last phase. So I couldn't do it with modular phase numbers and a loop. The last phase detaches from the chunk barrier. I don't want to detach from the chunk barrier if there are more chunks. I basically need a way to only attach to the chunk barrier at the begininng of the first chunk and only detach at the end of the last chunk--not in between chunks. I will return from the function and re-enter between chunks -- say between chunk 2 and chunk 3 of 5. However, could this be solved by having more than one chunk barrier? A worker would attach to one chunk barrier and then when it moves to the next chunk it would attach to the other chunk barrier and then switch back when it switches to the next chunk. Then it could detach and attach each time it enters/leaves the function. > * I'm not sure it's OK to wait at the end of each loop, as described > in the commit message: > > Workers probing a fallback batch will wait until all workers have > finished probing before moving on so that an elected worker can read > and combine the outer match status files into a single bitmap and use > it to emit unmatched outer tuples after all chunks of the inner side > have been processed. > > Maybe I misunderstood completely, but that seems to break the > programming rule described in nodeHashjoin.c's comment beginning "To > avoid deadlocks, ...". To recap: (1) When you emit a tuple, the > program counter escapes to some other node, and maybe that other node > waits for thee, (2) Maybe the leader is waiting for you but you're > waiting for it to drain its queue so you can emit a tuple (I learned a > proper name for this: "flow control deadlock"). That's why the > current code only ever detaches (a non-waiting operation) after it's > begun emitting tuples (that is, the probing phase). It just moves > onto another batch. That's not a solution here: you can't simply move > to another loop, loops are not independent of each other like batches. > It's possible that barriers are not the right tool for this part of > the problem, or that there is a way to use a barrier that you don't > remain attached to while emitting, or that we should remove the > deadlock risks another way entirely[1] but I'm not sure. Furthermore, > the new code in ExecParallelHashJoinNewBatch() appears to break the > rule even in the non-looping case (it calls BarrierArriveAndWait() in > ExecParallelHashJoinNewBatch(), where the existing code just > detaches). > Yea, I think I'm totally breaking that rule. Just to make sure I understand the way in which I am breaking that rule: In my patch, while attached to a chunk_barrier, worker1 emits a matched tuple (control leaves the current node). Meanwhile, worker2 has finished probing the chunk and is waiting on the chunk_barrier for worker1. How though could worker1 be waiting for worker2? Is this only a problem when one of the barrier participants is the leader and is reading from the tuple queue? (reading your tuple queue deadlock hazard example in the thread [1] you referred to). Basically is my deadlock hazard a tuple queue deadlock hazard? I thought maybe this could be a problem with nested HJ nodes, but I'm not sure. As I understand it, this isn't a problem with current master with batch barriers because while attached to a batch_barrier, a worker can emit tuples. No other workers will wait on the batch barrier once they have started probing. I need to think more about the suggestions you provided in [1] about nixing the tuple queue deadlock hazard. However, hypothetically, if we decide we don't want to break the no emitting tuples while attached to a barrier rule, how can we still allow workers to coordinate while probing chunks of the batch sequentially (1 chunk at a time)? I could think of two options (both sound slow and bad): Option 1: Stash away the matched tuples in a tuplestore and emit them at the end of the batch (incurring more writes). Option 2: Degenerate to 1 worker for fallback batches Any other ideas? > > > - Rename "chunk" (as in chunks of inner side) to something that is > > not already used in the context of memory chunks and, more > > importantly, SharedTuplestoreChunk > > +1. Fragments? Loops? Blocks (from > https://en.wikipedia.org/wiki/Block_nested_loop, though, no, strike > that, blocks are also super overloaded). > Hmmm. I think loop is kinda confusing. "fragment" has potential. I also thought of "piece". That is actually where I am leaning now. What do you think? [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com -- Melanie Plageman