Re: [HACKERS] Selectivity estimation for inet operators

Heikki Linnakangas Thu, 25 Sep 2014 02:14:11 -0700

On 09/07/2014 07:09 PM, Emre Hasegeli wrote:

I updated the patch to cover semi and anti joins with eqjoinsel_semi().
I think it is better than returning a constant.


What you did there is utterly unacceptable from a modularity standpoint;
and considering that the values will be nowhere near right, the argument
that "it's better than returning a constant" seems pretty weak.  I think
you should just take that out again.


I will try to come up with a better, data type specific implementation
in a week.


New version with semi join estimation function attached.

Thanks. Overall, my impression of this patch is that it works very well.But damned if I understood *how* it works :-). There's a lot ofstatistics involved, and it's not easy to see why something ismultiplied by something else. I'm adding comments as I read through it.


I've gotten to the inet_semi_join_selec function:

/*
 * Inet semi join selectivity estimation.
 */
static Selectivity
inet_semi_join_selec(bool mcv2_exists, Datum *mcv2_values, int mcv2_nvalues,
                                         bool his2_exists, Datum *his2_values, 
int his2_nvalues,
                                         double his2_weight, Datum *constvalue,
                                         FmgrInfo *proc, short opr_order)
{
        if (mcv2_exists)
        {
                int                     i;

                for (i = 0; i < mcv2_nvalues; i++)
                {
                        if (DatumGetBool(FunctionCall2Coll(proc, 
DEFAULT_COLLATION_OID,
                                                                                
           *constvalue, mcv2_values[i])))
                                return 1.0;
                }
        }

        /* Do not bother if histogram weight is smaller than 0.1. */
        if (his2_exists && his2_weight > 0.1)
        {
                Selectivity     his_selec;

                his_selec = inet_his_inclusion_selec(his2_values, his2_nvalues,
                                                                                
         constvalue, opr_order);

                if (his_selec > 0)
                        return Min(1.0, his2_weight * his_selec);
        }

        return 0.0;
}

This desperately needs comment at the top of the function explainingwhat it does. Let me try to explain what I think it does:

This function calculates the probability that there is at least one rowin table B, which satisfies the "constant op column" qual. The constantis passed as argument, and for table B, the MCV list and histogram isprovided. his2_weight is the total number of rows in B that are coveredby the histogram. For example, if the table has 1000 rows, and 10% ofthe rows in the table are in the MCV, and another 10% are NULLs,his_weight would be 800.

First, we check if the constant matches any of the most common values.If it does, return 1.0, because then there is surely a match.

Next, we use the histogram to estimate the number of rows in the tablethat matches the qual. If it amounts to more than 1 row, we return 1.0.If it's between 0.0 and 1.0 rows, we return that number as the probability.

Now, I think that last step is wrong. Firstly, the "Do not bother ifhistogram weight is smaller than 0.1" rule seems bogus. The his2_weightis the total number of rows represented by the histogram, so surely itcan't be less than 1. It can't really be less than the statisticstarget. Unless maybe if the histogram was collected when the table waslarge, but it has since shrunk to contain only a few rows, but thatseems like a very bizarre corner case. At least it needs more commentsexplaining what the test is all about, but I think we should just alwaysuse the histogram (if it's available).

Secondly, if we estimate that there is on average 1.0 matching row inthe table, it does not follow that the probability that at least one rowmatches is 1.0. Assuming a gaussian distribution with mean 1.0, theprobability that at least one row matches is 0.5. Assuming a gaussiandistribution here isn't quite right - I guess a Poisson distributionwould be more accurate - but it sure doesn't seem right as it is.

The error isn't very big, and perhaps you don't run into that veryoften, so I'm not sure what the best way to fix that would be. Mystatistics skills are a bit rusty, but I think the appropriate way wouldbe to apply the Poisson distribution, with the estimated number ofmatched rows as the mean. The probability of at least one match would bethe cumulative distribution function at k=1. It sounds like overkill, ifthis is case occurs only rarely. But then again, perhaps it's not allthat rare.

That said, I can't immediately find a test case where that error wouldmatter. I tried this:


create table inettbl1 (a inet);

insert into inettbl1 select '10.0.0.' || (g % 255) fromgenerate_series(1, 10) g;

analyze inettbl1;

explain analyze select count(*) from inettbl1 where a >>= ANY (SELECT afrom inettbl1);

The estimate for that is pretty accurate, 833 rows estimated vs 1000actual, with the current patch. I'm afraid if we fixedinet_semi_join_selec the way I suggest, the estimate would be smaller,i.e. more wrong. Is there something else in the estimates thataccidentally compensates for this currently?


Thoughts?

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Selectivity estimation for inet operators

Reply via email to