On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen <jinc...@pivotal.io> wrote:
> Hi hackers, > > I have made a patch to fix the problem. > > Added the selection rate of the inner table non-empty bucket > > The planner will use big table as inner table in hash join > if small table have fewer unique values. But this plan is > much slower than using small table as inner table. > > In general, the cost of creating a hash table is higher > than the cost of querying a hash table. So we tend to use > small tables as internal tables. But if the average chain > length of the bucket is large, the situation is just the > opposite. > > If virtualbuckets is much larger than innerndistinct, and > outerndistinct is much larger than innerndistinct. Then most > tuples of the outer table will match the empty bucket. So when > we calculate the cost of traversing the bucket, we need to > ignore the tuple matching empty bucket. > > So we add the selection rate of the inner table non-empty > bucket. The formula is: > (1 - ((outerndistinct - innerndistinct)/outerndistinct)* > ((virtualbuckets - innerndistinct)/virtualbuckets)) > > > On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen <jinc...@pivotal.io> wrote: > >> I think we have the same understanding of this issue. >> >> Sometimes use smaller costs on scanning the chain in bucket like below >> would >> be better. >> run_cost += outer_path_rows * some_small_probe_cost; >> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count(); >> In some version of GreenPlum(a database based on postgres), we just >> disabled >> the cost on scanning the bucket chain. In most cases, this can get a >> better query >> plan. But I am worried that it will be worse in some cases. >> >> Now only the small table's distinct value is much smaller than the bucket >> number, >> and much smaller than the distinct value of the large table, the planner >> will get the >> wrong plan. >> >> For example, if inner table has 100 distinct values, and 3000 rows. Hash >> table >> has 1000 buckets. Outer table has 10000 distinct values. >> We can assume that all the 100 distinct values of the inner table are >> included in the >> 10000 distinct values of the outer table. So (100/10000)*outer_rows >> tuples will >> probe the buckets has chain. And (9900/10000)*outer_rows tuples will probe >> all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000) >> tuples will >> probe empty buckets. So the costs on scanning bucket chain is >> >> hash_qual_cost.per_tuple*innerbucketsize*outer_rows* >> (1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num - >> inner_disttinct)/buckets_num)) >> >> Do you think this assumption is reasonable? >> >> >> On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas.mu...@gmail.com> >> wrote: >> >>> On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinc...@pivotal.io> wrote: >>> > In the test case above, the small table has 3000 tuples and 100 >>> distinct values on column ‘a’. >>> > If we use small table as inner table. The chan length of the bucket >>> is 30. And we need to >>> > search the whole chain on probing the hash table. So the cost of >>> probing is bigger than build >>> > hash table, and we need to use big table as inner. >>> > >>> > But in fact this is not true. We initialized 620,000 buckets in >>> hashtable. But only 100 buckets >>> > has chains with length 30. Other buckets are empty. Only hash values >>> need to be compared. >>> > Its costs are very small. We have 100,000 distinct key and 100,000,000 >>> tuple on outer table. >>> > Only (100/100000)* tuple_num tuples will search the whole chain. The >>> other tuples >>> > (number = (98900/100000)*tuple_num*) in outer >>> > table just compare with the hash value. So the actual cost is much >>> smaller than the planner >>> > calculated. This is the reason why using a small table as inner is >>> faster. >>> >>> So basically we think that if t_big is on the outer side, we'll do >>> 100,000,000 probes and each one is going to scan a t_small bucket with >>> chain length 30, so that looks really expensive. Actually only a >>> small percentage of its probes find tuples with the right hash value, >>> but final_cost_hash_join() doesn't know that. So we hash t_big >>> instead, which we estimated pretty well and it finishes up with >>> buckets of length 1,000 (which is actually fine in this case, they're >>> not unwanted hash collisions, they're duplicate keys that we need to >>> emit) and we probe them 3,000 times (which is also fine in this case), >>> but we had to do a bunch of memory allocation and/or batch file IO and >>> that turns out to be slower. >>> >>> I am not at all sure about this but I wonder if it would be better to >>> use something like: >>> >>> run_cost += outer_path_rows * some_small_probe_cost; >>> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count(); >>> >>> If we can estimate how many tuples will actually match accurately, >>> that should also be the number of times we have to run the quals, >>> since we don't usually expect hash collisions (bucket collisions, yes, >>> but hash collisions where the key doesn't turn out to be equal, no*). >>> >>> * ... but also yes as you approach various limits, so you could also >>> factor in bucket chain length that is due to being prevented from >>> expanding the number of buckets by arbitrary constraints, and perhaps >>> also birthday_problem(hash size, key space) to factor in unwanted hash >>> collisions that start to matter once you get to billions of keys and >>> expect collisions with short hashes. >>> >> FYI: I tried this on 12.1, and find it use small_table as inner table already. I didn't look into the details so far. postgres=# explain (costs off) select * from join_hash_t_small, join_hash_t_big where a = b; QUERY PLAN -------------------------------------------------------- Hash Join Hash Cond: (join_hash_t_big.b = join_hash_t_small.a) -> Seq Scan on join_hash_t_big -> Hash -> Seq Scan on join_hash_t_small (5 rows) postgres=# select version(); version ----------------------------------------------------------------------------------------------------------------- PostgreSQL 12.1 on x86_64-apple-darwin18.7.0, compiled by Apple LLVM version 10.0.1 (clang-1001.0.46.4), 64-bit (1 row)