On Fri, Jun 7, 2019 at 10:17 AM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Yes, they could get quite big, and I think you're right we need to > keep that in mind, because it's on the outer (often quite large) side of > the join. And if we're aiming to restrict memory usage, it'd be weird to > just ignore this. > > But I think Thomas Munro originally proposed to treat this as a separate > BufFile, so my assumption was each worker would simply rewrite the bitmap > repeatedly for each hash table fragment. That means a bit more I/O, but as > those files are buffered and written in 8kB pages, with just 1 bit per > tuple. I think that's pretty OK and way cheaper that rewriting the whole > batch, where each tuple can be hundreds of bytes.
Yes, this is also my thought. I'm not 100% sure I understand Melanie's proposal, but I think that it involves writing every still-unmatched outer tuple for every inner batch. This proposal -- assuming we can get the tuple numbering worked out -- involves writing a bit for every outer tuple for every inner batch. So each time you do an inner batch, you write either (a) one bit for EVERY outer tuple or (b) the entirety of each unmatched tuple. It's possible for the latter to be cheaper if the number of unmatched tuples is really, really tiny, but it's not very likely. For example, suppose that you've got 4 batches and each batch matches 99% of the tuples, which are each 50 bytes wide. After each batch, approach A writes 1 bit per tuple, so a total of 4 bits per tuple after 4 batches. Approach B writes a different amount of data after each batch. After the first batch, it writes 1% of the tuples, and for each one written it writes 50 bytes, so it writes 50 bytes * 0.01 = ~4 bits/tuple. That's already equal to what approach A wrote after all 4 batches, and it's going to do a little more I/O over the course of the remaining matches - although not much, because the unmatched tuples file will be very very tiny after we eliminate 99% of the 1% that survived the first batch. However, these are extremely favorable assumptions for approach B. If the tuples are wider or the batches match only say 20% of the tuples, approach B is going to be waaaay more I/O. Assuming I understand correctly, which I may not. > Also, it does not require any concurrency control, which rewriting the > batches themselves probably does (because we'd be feeding the tuples into > some shared file, I suppose). Except for the final step when we need to > merge the bitmaps, of course. I suppose that rewriting the batches -- or really the unmatched tuples file -- could just use a SharedTuplestore, so we probably wouldn't need a lot of new code for this. I don't know whether contention would be a problem or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company