On Thu, Nov 28, 2019 at 7:19 PM Jinbao Chen <jinc...@pivotal.io> wrote:
> Hi Andy, > > I just test the query on 12.1. But pg use big_table as inner. > > demo=# explain (costs off) select * from t_small, t_big where a = b; > QUERY PLAN > ------------------------------------ > Hash Join > Hash Cond: (t_small.a = t_big.b) > -> Seq Scan on t_small > -> Hash > -> Seq Scan on t_big > > Do you insert data and set max_parallel_workers_per_gather to 0 like > above? > Sorry for the noise.. you are right. I thought I load the data but and run the query immediately without running the analyzing. now it is using big table as inner table. > create table t_small(a int); > create table t_big(b int); > insert into t_small select i%100 from generate_series(0, 3000)i; > insert into t_big select i%100000 from generate_series(1, 100000000)i ; > analyze t_small; > analyze t_big; > set max_parallel_workers_per_gather = 0; > > On Thu, Nov 28, 2019 at 5:46 PM Andy Fan <zhihui.fan1...@gmail.com> wrote: > >> >> >> On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen <jinc...@pivotal.io> wrote: >> >>> Hi hackers, >>> >>> I have made a patch to fix the problem. >>> >>> Added the selection rate of the inner table non-empty bucket >>> >>> The planner will use big table as inner table in hash join >>> if small table have fewer unique values. But this plan is >>> much slower than using small table as inner table. >>> >>> In general, the cost of creating a hash table is higher >>> than the cost of querying a hash table. So we tend to use >>> small tables as internal tables. But if the average chain >>> length of the bucket is large, the situation is just the >>> opposite. >>> >>> If virtualbuckets is much larger than innerndistinct, and >>> outerndistinct is much larger than innerndistinct. Then most >>> tuples of the outer table will match the empty bucket. So when >>> we calculate the cost of traversing the bucket, we need to >>> ignore the tuple matching empty bucket. >>> >>> So we add the selection rate of the inner table non-empty >>> bucket. The formula is: >>> (1 - ((outerndistinct - innerndistinct)/outerndistinct)* >>> ((virtualbuckets - innerndistinct)/virtualbuckets)) >>> >>> >>> On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen <jinc...@pivotal.io> wrote: >>> >>>> I think we have the same understanding of this issue. >>>> >>>> Sometimes use smaller costs on scanning the chain in bucket like below >>>> would >>>> be better. >>>> run_cost += outer_path_rows * some_small_probe_cost; >>>> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count(); >>>> In some version of GreenPlum(a database based on postgres), we just >>>> disabled >>>> the cost on scanning the bucket chain. In most cases, this can get a >>>> better query >>>> plan. But I am worried that it will be worse in some cases. >>>> >>>> Now only the small table's distinct value is much smaller than the >>>> bucket number, >>>> and much smaller than the distinct value of the large table, the >>>> planner will get the >>>> wrong plan. >>>> >>>> For example, if inner table has 100 distinct values, and 3000 rows. >>>> Hash table >>>> has 1000 buckets. Outer table has 10000 distinct values. >>>> We can assume that all the 100 distinct values of the inner table are >>>> included in the >>>> 10000 distinct values of the outer table. So (100/10000)*outer_rows >>>> tuples will >>>> probe the buckets has chain. And (9900/10000)*outer_rows tuples will >>>> probe >>>> all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000) >>>> tuples will >>>> probe empty buckets. So the costs on scanning bucket chain is >>>> >>>> hash_qual_cost.per_tuple*innerbucketsize*outer_rows* >>>> (1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num - >>>> inner_disttinct)/buckets_num)) >>>> >>>> Do you think this assumption is reasonable? >>>> >>>> >>>> On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas.mu...@gmail.com> >>>> wrote: >>>> >>>>> On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinc...@pivotal.io> >>>>> wrote: >>>>> > In the test case above, the small table has 3000 tuples and 100 >>>>> distinct values on column ‘a’. >>>>> > If we use small table as inner table. The chan length of the bucket >>>>> is 30. And we need to >>>>> > search the whole chain on probing the hash table. So the cost of >>>>> probing is bigger than build >>>>> > hash table, and we need to use big table as inner. >>>>> > >>>>> > But in fact this is not true. We initialized 620,000 buckets in >>>>> hashtable. But only 100 buckets >>>>> > has chains with length 30. Other buckets are empty. Only hash values >>>>> need to be compared. >>>>> > Its costs are very small. We have 100,000 distinct key and >>>>> 100,000,000 tuple on outer table. >>>>> > Only (100/100000)* tuple_num tuples will search the whole chain. The >>>>> other tuples >>>>> > (number = (98900/100000)*tuple_num*) in outer >>>>> > table just compare with the hash value. So the actual cost is much >>>>> smaller than the planner >>>>> > calculated. This is the reason why using a small table as inner is >>>>> faster. >>>>> >>>>> So basically we think that if t_big is on the outer side, we'll do >>>>> 100,000,000 probes and each one is going to scan a t_small bucket with >>>>> chain length 30, so that looks really expensive. Actually only a >>>>> small percentage of its probes find tuples with the right hash value, >>>>> but final_cost_hash_join() doesn't know that. So we hash t_big >>>>> instead, which we estimated pretty well and it finishes up with >>>>> buckets of length 1,000 (which is actually fine in this case, they're >>>>> not unwanted hash collisions, they're duplicate keys that we need to >>>>> emit) and we probe them 3,000 times (which is also fine in this case), >>>>> but we had to do a bunch of memory allocation and/or batch file IO and >>>>> that turns out to be slower. >>>>> >>>>> I am not at all sure about this but I wonder if it would be better to >>>>> use something like: >>>>> >>>>> run_cost += outer_path_rows * some_small_probe_cost; >>>>> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count(); >>>>> >>>>> If we can estimate how many tuples will actually match accurately, >>>>> that should also be the number of times we have to run the quals, >>>>> since we don't usually expect hash collisions (bucket collisions, yes, >>>>> but hash collisions where the key doesn't turn out to be equal, no*). >>>>> >>>>> * ... but also yes as you approach various limits, so you could also >>>>> factor in bucket chain length that is due to being prevented from >>>>> expanding the number of buckets by arbitrary constraints, and perhaps >>>>> also birthday_problem(hash size, key space) to factor in unwanted hash >>>>> collisions that start to matter once you get to billions of keys and >>>>> expect collisions with short hashes. >>>>> >>>> >> FYI: I tried this on 12.1, and find it use small_table as inner table >> already. I didn't look into the details so far. >> >> postgres=# explain (costs off) select * from join_hash_t_small, >> join_hash_t_big where a = b; >> QUERY PLAN >> -------------------------------------------------------- >> Hash Join >> Hash Cond: (join_hash_t_big.b = join_hash_t_small.a) >> -> Seq Scan on join_hash_t_big >> -> Hash >> -> Seq Scan on join_hash_t_small >> (5 rows) >> >> postgres=# select version(); >> version >> >> ----------------------------------------------------------------------------------------------------------------- >> PostgreSQL 12.1 on x86_64-apple-darwin18.7.0, compiled by Apple LLVM >> version 10.0.1 (clang-1001.0.46.4), 64-bit >> (1 row) >> >