Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

Andy Fan Thu, 28 Nov 2019 07:21:54 -0800

On Thu, Nov 28, 2019 at 7:19 PM Jinbao Chen <jinc...@pivotal.io> wrote:


> Hi Andy,
>
> I just test the query on 12.1. But pg use big_table as inner.
>
> demo=# explain (costs off) select * from t_small, t_big where a = b;
>              QUERY PLAN
> ------------------------------------
>  Hash Join
>    Hash Cond: (t_small.a = t_big.b)
>    ->  Seq Scan on t_small
>    ->  Hash
>          ->  Seq Scan on t_big
>
> Do you insert data and  set max_parallel_workers_per_gather to 0 like
> above?
>

Sorry for the noise..  you are right.  I thought I load the data but and
run the query immediately without running the analyzing.

now it is using big table as inner table.


> create table t_small(a int);
> create table t_big(b int);
> insert into t_small select i%100 from generate_series(0, 3000)i;
> insert into t_big select i%100000 from generate_series(1, 100000000)i ;
> analyze t_small;
> analyze t_big;
> set max_parallel_workers_per_gather = 0;
>
> On Thu, Nov 28, 2019 at 5:46 PM Andy Fan <zhihui.fan1...@gmail.com> wrote:
>
>>
>>
>> On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen <jinc...@pivotal.io> wrote:
>>
>>> Hi hackers,
>>>
>>> I have made a patch to fix the problem.
>>>
>>> Added the selection rate of the inner table non-empty bucket
>>>
>>> The planner will use big table as inner table in hash join
>>> if small table have fewer unique values. But this plan is
>>> much slower than using small table as inner table.
>>>
>>> In general, the cost of creating a hash table is higher
>>> than the cost of querying a hash table. So we tend to use
>>> small tables as internal tables. But if the average chain
>>> length of the bucket is large, the situation is just the
>>> opposite.
>>>
>>> If virtualbuckets is much larger than innerndistinct, and
>>> outerndistinct is much larger than innerndistinct. Then most
>>> tuples of the outer table will match the empty bucket. So when
>>> we calculate the cost of traversing the bucket, we need to
>>> ignore the tuple matching empty bucket.
>>>
>>> So we add the selection rate of the inner table non-empty
>>> bucket. The formula is:
>>> (1 - ((outerndistinct - innerndistinct)/outerndistinct)*
>>> ((virtualbuckets - innerndistinct)/virtualbuckets))
>>>
>>>
>>> On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen <jinc...@pivotal.io> wrote:
>>>
>>>> I think we have the same understanding of this issue.
>>>>
>>>> Sometimes use smaller costs on scanning the chain in bucket like below
>>>> would
>>>> be better.
>>>> run_cost += outer_path_rows * some_small_probe_cost;
>>>> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>>>> In some version of GreenPlum(a database based on postgres), we just
>>>> disabled
>>>> the cost on scanning the bucket chain. In most cases, this can get a
>>>> better query
>>>> plan. But I am worried that it will be worse in some cases.
>>>>
>>>> Now only the small table's distinct value is much smaller than the
>>>> bucket number,
>>>> and much smaller than the distinct value of the large table, the
>>>> planner will get the
>>>> wrong plan.
>>>>
>>>> For example, if inner table has 100 distinct values, and 3000 rows.
>>>> Hash table
>>>> has 1000 buckets. Outer table has 10000 distinct values.
>>>> We can assume that all the 100 distinct values of the inner table are
>>>> included in the
>>>> 10000 distinct values of the outer table. So (100/10000)*outer_rows
>>>> tuples will
>>>> probe the buckets has chain. And (9900/10000)*outer_rows tuples will
>>>> probe
>>>> all the 1000 buckets randomly. So (9900/10000)*outer_rows*(900/1000)
>>>> tuples will
>>>> probe empty buckets. So the costs on scanning bucket chain is
>>>>
>>>> hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
>>>> (1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
>>>> inner_disttinct)/buckets_num))
>>>>
>>>> Do you think this assumption is reasonable?
>>>>
>>>>
>>>> On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro <thomas.mu...@gmail.com>
>>>> wrote:
>>>>
>>>>> On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen <jinc...@pivotal.io>
>>>>> wrote:
>>>>> > In the test case above, the small table has 3000 tuples and 100
>>>>> distinct values on column ‘a’.
>>>>> > If we use small table as inner table.  The chan length of the bucket
>>>>> is 30. And we need to
>>>>> > search the whole chain on probing the hash table. So the cost of
>>>>> probing is bigger than build
>>>>> > hash table, and we need to use big table as inner.
>>>>> >
>>>>> > But in fact this is not true. We initialized 620,000 buckets in
>>>>> hashtable. But only 100 buckets
>>>>> > has chains with length 30. Other buckets are empty. Only hash values
>>>>> need to be compared.
>>>>> > Its costs are very small. We have 100,000 distinct key and
>>>>> 100,000,000 tuple on outer table.
>>>>> > Only (100/100000)* tuple_num tuples will search the whole chain. The
>>>>> other tuples
>>>>> > (number = (98900/100000)*tuple_num*) in outer
>>>>> > table just compare with the hash value. So the actual cost is much
>>>>> smaller than the planner
>>>>> > calculated. This is the reason why using a small table as inner is
>>>>> faster.
>>>>>
>>>>> So basically we think that if t_big is on the outer side, we'll do
>>>>> 100,000,000 probes and each one is going to scan a t_small bucket with
>>>>> chain length 30, so that looks really expensive.  Actually only a
>>>>> small percentage of its probes find tuples with the right hash value,
>>>>> but final_cost_hash_join() doesn't know that.  So we hash t_big
>>>>> instead, which we estimated pretty well and it finishes up with
>>>>> buckets of length 1,000 (which is actually fine in this case, they're
>>>>> not unwanted hash collisions, they're duplicate keys that we need to
>>>>> emit) and we probe them 3,000 times (which is also fine in this case),
>>>>> but we had to do a bunch of memory allocation and/or batch file IO and
>>>>> that turns out to be slower.
>>>>>
>>>>> I am not at all sure about this but I wonder if it would be better to
>>>>> use something like:
>>>>>
>>>>>   run_cost += outer_path_rows * some_small_probe_cost;
>>>>>   run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>>>>>
>>>>> If we can estimate how many tuples will actually match accurately,
>>>>> that should also be the number of times we have to run the quals,
>>>>> since we don't usually expect hash collisions (bucket collisions, yes,
>>>>> but hash collisions where the key doesn't turn out to be equal, no*).
>>>>>
>>>>> * ... but also yes as you approach various limits, so you could also
>>>>> factor in bucket chain length that is due to being prevented from
>>>>> expanding the number of buckets by arbitrary constraints, and perhaps
>>>>> also birthday_problem(hash size, key space) to factor in unwanted hash
>>>>> collisions that start to matter once you get to billions of keys and
>>>>> expect collisions with short hashes.
>>>>>
>>>>
>> FYI:  I tried this on 12.1, and find it use small_table as inner table
>> already.  I didn't look into the details so far.
>>
>> postgres=# explain (costs off) select * from join_hash_t_small,
>> join_hash_t_big where a = b;
>>                        QUERY PLAN
>> --------------------------------------------------------
>>  Hash Join
>>    Hash Cond: (join_hash_t_big.b = join_hash_t_small.a)
>>    ->  Seq Scan on join_hash_t_big
>>    ->  Hash
>>          ->  Seq Scan on join_hash_t_small
>> (5 rows)
>>
>> postgres=# select version();
>>                                                      version
>>
>> -----------------------------------------------------------------------------------------------------------------
>>  PostgreSQL 12.1 on x86_64-apple-darwin18.7.0, compiled by Apple LLVM
>> version 10.0.1 (clang-1001.0.46.4), 64-bit
>> (1 row)
>>
>

Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

Reply via email to