On Mon, Sep 11, 2023, at 11:51 AM, Andy Fan wrote:
> Hi, 
>
> On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov 
> <a.lepik...@postgrespro.ru> wrote:
>> Hi, all.
>> 
>> Some of my clients use JOIN's with three - four clauses. Quite 
>> frequently, I see complaints on unreasonable switch of JOIN algorithm to 
>> Merge Join instead of Hash Join. Quick research have shown one weak 
>> place - estimation of an average bucket size in final_cost_hashjoin (see 
>> q2.sql in attachment) with very conservative strategy.
>> Unlike estimation of groups, here we use smallest ndistinct value across 
>> all buckets instead of multiplying them (or trying to make multivariate 
>> analysis).
>> It works fine for the case of one clause. But if we have many clauses, 
>> and if each has high value of ndistinct, we will overestimate average 
>> size of a bucket and, as a result, prefer to use Merge Join. As the 
>> example in attachment shows, it leads to worse plan than possible, 
>> sometimes drastically worse.
>> I assume, this is done with fear of functional dependencies between hash 
>> clause components. But as for me, here we should go the same way, as 
>> estimation of groups.
>
> I can reproduce the visitation you want to improve and verify the patch
> can do it expectedly.  I think this is a right thing to do.  
> 
>> The attached patch shows a sketch of the solution.
>
> I understand that this is a sketch of the solution,  but the  below 
> changes still
> make me confused. 
>
> + if (innerbucketsize > virtualbuckets)
> +     innerbucketsize = 1.0 / virtualbuckets;
>
> innerbucketsize is a fraction of rows in all the rows, so it is between 
> 0.0 and 1.0.
> and virtualbuckets is the number of buckets in total (when considered 
> the mutli
> batchs),  how is it possible for 'innerbucketsize > virtualbuckets' ?  
> Am
> I missing something? 

You are right here. I've made a mistake here. Changed diff is in attachment.

-- 
Regards,
Andrei Lepikhov

Attachment: fix_bucketsize_v2.diff
Description: Binary data

Reply via email to