On Tue, Nov 21, 2017 at 5:38 PM, Peter Geoghegan <p...@bowt.ie> wrote: >> That having been said, I think the place where our plans most commonly >> go wrong is where we incorrectly estimate the number of tuples by >> multiple orders of magnitude - 100x is common, 1000x is common, a >> million x is not uncommon, even a billion x is not unheard-of. And I >> don't think there's any way to make a hash join happy if it thinks >> it's going to need 1 batch and it ends up needing a million batches. > > What about dynamic role reversal? That could make a big difference.
In the best case it's great, but it looks to me like there are a lot of thorny problems. For example, imagine giant_table INNER JOIN bigger_than_we_thought The latter table will be chosen as the inner table and that won't work out very well, but there's no way to know whether switching the sides will be any better except to try reading a bunch of rows from giant_table and seeing whether it turns out to be a lot smaller than we thought. To do that, we'll need to dump the hash table we started to build on the original inner side out to disk so that we can free up enough work_mem to try building a hash table on the other side. When the giant table turns out to actually be giant, we'll need to go back to the original plan, which means dumping out the tuples from the second hash table and reloading the tuples from the first one. So we end up just doing a bunch of extra work for nothing. I think that this scenario - wasting effort trying to switch the sides only to give up - will happen frequently. In the multi-batch case, there seems to be a little more hope of doing something clever. We're anyway writing out most of both inputs out to tapes. If we were willing to write ALL of both inputs out to tapes, then we could decide - perhaps even separately for each batch - which side to load into the hash table. Of course, that adds a lot of incremental I/O unless the number of batches is large (e.g. if we had only 4 batches, writing 4/4 of the data instead of 3/4 is a 33% increase, but if we had 64 batches, writing 64/64 of the data instead of 63/64 doesn't matter a lot, probably). And it leaves out a few important details, like the fact that what fits in the hash table is used to choose the number of batches in the first place, and that we write the whole of one side to tapes before starting on the other side. I don't know how to handle those problems but it seems like it might be possible to come up with something clever, at least for certain cases. > I agree that it would be enormously valuable if we could make > estimates much better, so I think that I understand why you emphasize > it. But, I don't think that there are any good ideas for improving > join selectivity that don't involve expert DBA knowledge, or > novel/risky techniques for feedback to the system about column > redundancy/correlation, etc. These do not seem like scalable > approaches, and so they don't particularly appeal to me as projects. > I'd be happy to be shown to be wrong about this. Yeah, I agree that it's a hard problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company