Re: using extended statistics to improve join estimates

Tomas Vondra Tue, 02 Apr 2024 11:23:15 -0700

On 4/2/24 10:23, Andy Fan wrote:
> 
>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>>> Rebased over 269b532ae and muted compiler warnings.
> 
> Thank you Justin for the rebase!
> 
> Hello Tomas,
> 
> Thanks for the patch! Before I review the path at the code level, I want
> to explain my understanding about this patch first.
>


If you want to work on this patch, that'd be cool. A review would be
great, but if you want to maybe take over and try moving it forward,
that'd be even better. I don't know when I'll have time to work on it
again, but I'd promise to help you with working on it.

> Before this patch, we already use MCV information for the eqjoinsel, it
> works as combine the MCV on the both sides to figure out the mcv_freq
> and then treat the rest equally, but this doesn't work for MCV in
> extended statistics, this patch fill this gap. Besides that, since
> extended statistics means more than 1 columns are involved, if 1+
> columns are Const based on RestrictInfo, we can use such information to
> filter the MCVs we are interesting, that's really cool. 
> 

Yes, I think that's an accurate description of what the patch does.

> I did some more testing, all of them are inner join so far, all of them
> works amazing and I am suprised this patch didn't draw enough
> attention. I will test more after I go though the code.
> 

I think it didn't go forward for a bunch of reasons:

1) I got distracted by something else requiring immediate attention, and
forgot about this patch.

2) I got stuck on some detail of the patch, unsure which of the possible
solutions to try first.

3) Uncertainty about how applicable the patch is in practice.

I suppose it was some combination of these reasons, not sure.


As for the "practicality" mentioned in (3), it's been a while since I
worked on the patch so I don't recall the details, but I think I've been
thinking mostly about "start join" queries, where a big "fact" table
joins to small dimensions. And in that case the fact table may have a
MCV, but the dimensions certainly don't have any (because the join
happens on a PK).

But maybe that's a wrong way to think about it - it was clearly useful
to consider the case with (per-attribute) MCVs on both sides as worth
special handling. So why not to do that for multi-column MCVs, right?

> At for the code level, I reviewed them in the top-down manner and almost
> 40% completed. Here are some findings just FYI. For efficiency purpose,
> I provide each feedback with a individual commit, after all I want to
> make sure my comment is practical and coding and testing is a good way
> to archive that. I tried to make each of them as small as possible so
> that you can reject or accept them convinently.
> 
> 0001 is your patch, I just rebase them against the current master. 0006
> is not much relevant with current patch, and I think it can be committed
> individually if you are OK with that.
> 
> Hope this kind of review is helpful.
> 

Cool! There's obviously no chance to get this into v18, and I have stuff
to do in this CF. But I'll take a look after that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: using extended statistics to improve join estimates

Reply via email to