On Wed, Jul 31, 2019 at 6:47 AM Melanie Plageman <melanieplage...@gmail.com> wrote: > So, I've rewritten the patch to use a BufFile for the outer table > batch file tuples' match statuses and write bytes to and from the file > which start as 0 and, upon encountering a match for a tuple, I set its > bit in the file to 1 (also rebased with current master). > > It, of course, only works for parallel-oblivious hashjoin -- it relies > on deterministic order of tuples encountered in the outer side batch > file to set the right match bit and uses a counter to decide which bit > to set. > > I did the "needlessly dumb implementation" Robert mentioned, though, > I thought about it and couldn't come up with a much smarter way to > write match bits to a file. I think there might be an optimization > opportunity in not writing the current_byte to the file each time that > the outer tuple matches and only doing this once we have advanced to a > tuple number that wouldn't have its match bit in the current_byte. I > didn't do that to keep it simple, and, I suspect there might be a bit > of gymnastics needed to make sure that that byte is actually written > to the file in case we exit from some other state before we encounter > the tuple represented in the last bit in that byte.
Thanks for working on this! I plan to poke at it a bit in the next few weeks. > I plan to work on a separate implementation for parallel hashjoin > next--to understand what is required. I believe the logic to decide > when to fall back should be fairly easy to slot in at the end once > we've decided what that logic is. Seems like a good time for me to try to summarise what I think the main problems are here: 1. The match-bit storage problem already discussed. The tuples that each process receives while reading from SharedTupleStore are non-deterministic (like other parallel scans). To use a bitmap-based approach, I guess we'd need to invent some way to give the tuples a stable identifier within some kind of densely packed number space that we could use to address the bitmap, or take the IO hit and write all the tuples back. That might involve changing the way SharedTupleStore holds data. 2. Tricky problems relating to barriers and flow control. First, let me explain why PHJ doesn't support full/right outer joins yet. At first I thought it was going to be easy, because, although the shared memory hash table is read-only after it has been built, it seems safe to weaken that only slightly and let the match flag be set by any process during probing: it's OK if two processes clobber each other's writes, as the only transition is a single bit going strictly from 0 to 1, and there will certainly be a full memory barrier before anyone tries to read those match bits. Then during the scan for unmatched, you just have to somehow dole out hash table buckets or ranges of buckets to processes on a first-come-first-served basis. But.... then I crashed into the following problem: * You can't begin the scan for unmatched tuples until every process has finished probing (ie until you have the final set of match bits). * You can't wait for every process to finish probing, because any process that has emitted a tuple might never come back if there is another node that is also waiting for all processes (ie deadlock against another PHJ doing the same thing), and probing is a phase that emits tuples. Generally, it's not safe to emit tuples while you are attached to a Barrier, unless you're only going to detach from it, not wait at it, because emitting tuples lets the program counter escape your control. Generally, it's not safe to detach from a Barrier while accessing resources whose lifetime it controls, such as a hash table, because then it might go away underneath you. The PHJ plans that are supported currently adhere to that programming rule and so don't have a problem: after the Barrier reaches the probing phase, processes never wait for each other again so they're free to begin emitting tuples. They just detach when they're done probing, and the last to detach cleans up (frees the hash table etc). If there is more than one batch, they detach from one batch and attach to another when they're ready (each batch has its own Barrier), so we can consider the batches to be entirely independent. There is probably a way to make a scan-for-unmatched-inner phase work, possibly involving another Barrier or something like that, but I ran out of time trying to figure it out and wanted to ship a working PHJ for the more common plan types. I suppose PHLJ will face two variants of this problem: (1) you need to synchronise the loops (you can't dump the hash table in preparation for the next loop until all have finished probing for the current loop), and yet you've already emitted tuples, so you're not allowed to wait for other processes and they're not allowed to wait for you, and (2) you can't start the scan-for-unmatched-outer until all the probe loops belonging to one batch are done. The first problem is sort of analogous to a problem I faced with batches in the first place, which Robert and I found a solution to by processing the batches in parallel, and could perhaps be solved in the same way: run the loops in parallel (if that sounds crazy, recall that every worker has its own quota of work_mem and the data is entirely prepartitioned up front, which is why we are able to run the batches in parallel; in constrast, single-batch mode makes a hash table with a quota of nparticipants * work_mem). The second problem is sort of analogous to the existing scan-for-unmatched-inner problem that I haven't solved. I think there may be ways to make that general class of deadlock problem go away in a future asynchronous executor model where N streams conceptually run concurrently in event-driven nodes so that control never gets stuck in a node, but that seems quite far off and I haven't worked out the details. The same problem comes up in a hypothetical Parallel Repartition node: you're not done with your partition until all processes have run out of input tuples, so you have to wait for all of them to send an EOF, so you risk deadlock if they are waiting for you elsewhere in the tree. A stupid version of the idea is to break the node up into a consumer part and a producer part, and put the producer into a subprocess so that its program counter can never escape and deadlock somewhere in the consumer part of the plan. Obviously we don't want to have loads of extra OS processes all over the place, but I think you can get the same effect using a form of asynchronous execution where the program counter jumps between nodes and streams based on readiness, and yields control instead of blocking. Similar ideas have been proposed to deal with asynchronous IO. -- Thomas Munro https://enterprisedb.com