On 09/07/2014 07:09 PM, Emre Hasegeli wrote:
I updated the patch to cover semi and anti joins with eqjoinsel_semi().
I think it is better than returning a constant.

What you did there is utterly unacceptable from a modularity standpoint;
and considering that the values will be nowhere near right, the argument
that "it's better than returning a constant" seems pretty weak.  I think
you should just take that out again.

I will try to come up with a better, data type specific implementation
in a week.

New version with semi join estimation function attached.

Thanks. Overall, my impression of this patch is that it works very well. But damned if I understood *how* it works :-). There's a lot of statistics involved, and it's not easy to see why something is multiplied by something else. I'm adding comments as I read through it.

I've gotten to the inet_semi_join_selec function:

/*
 * Inet semi join selectivity estimation.
 */
static Selectivity
inet_semi_join_selec(bool mcv2_exists, Datum *mcv2_values, int mcv2_nvalues,
                                         bool his2_exists, Datum *his2_values, 
int his2_nvalues,
                                         double his2_weight, Datum *constvalue,
                                         FmgrInfo *proc, short opr_order)
{
        if (mcv2_exists)
        {
                int                     i;

                for (i = 0; i < mcv2_nvalues; i++)
                {
                        if (DatumGetBool(FunctionCall2Coll(proc, 
DEFAULT_COLLATION_OID,
                                                                                
           *constvalue, mcv2_values[i])))
                                return 1.0;
                }
        }

        /* Do not bother if histogram weight is smaller than 0.1. */
        if (his2_exists && his2_weight > 0.1)
        {
                Selectivity     his_selec;

                his_selec = inet_his_inclusion_selec(his2_values, his2_nvalues,
                                                                                
         constvalue, opr_order);

                if (his_selec > 0)
                        return Min(1.0, his2_weight * his_selec);
        }

        return 0.0;
}

This desperately needs comment at the top of the function explaining what it does. Let me try to explain what I think it does:

This function calculates the probability that there is at least one row in table B, which satisfies the "constant op column" qual. The constant is passed as argument, and for table B, the MCV list and histogram is provided. his2_weight is the total number of rows in B that are covered by the histogram. For example, if the table has 1000 rows, and 10% of the rows in the table are in the MCV, and another 10% are NULLs, his_weight would be 800.

First, we check if the constant matches any of the most common values. If it does, return 1.0, because then there is surely a match.

Next, we use the histogram to estimate the number of rows in the table that matches the qual. If it amounts to more than 1 row, we return 1.0. If it's between 0.0 and 1.0 rows, we return that number as the probability.


Now, I think that last step is wrong. Firstly, the "Do not bother if histogram weight is smaller than 0.1" rule seems bogus. The his2_weight is the total number of rows represented by the histogram, so surely it can't be less than 1. It can't really be less than the statistics target. Unless maybe if the histogram was collected when the table was large, but it has since shrunk to contain only a few rows, but that seems like a very bizarre corner case. At least it needs more comments explaining what the test is all about, but I think we should just always use the histogram (if it's available).

Secondly, if we estimate that there is on average 1.0 matching row in the table, it does not follow that the probability that at least one row matches is 1.0. Assuming a gaussian distribution with mean 1.0, the probability that at least one row matches is 0.5. Assuming a gaussian distribution here isn't quite right - I guess a Poisson distribution would be more accurate - but it sure doesn't seem right as it is.

The error isn't very big, and perhaps you don't run into that very often, so I'm not sure what the best way to fix that would be. My statistics skills are a bit rusty, but I think the appropriate way would be to apply the Poisson distribution, with the estimated number of matched rows as the mean. The probability of at least one match would be the cumulative distribution function at k=1. It sounds like overkill, if this is case occurs only rarely. But then again, perhaps it's not all that rare.

That said, I can't immediately find a test case where that error would matter. I tried this:

create table inettbl1 (a inet);
insert into inettbl1 select '10.0.0.' || (g % 255) from generate_series(1, 10) g;
analyze inettbl1;
explain analyze select count(*) from inettbl1 where a >>= ANY (SELECT a from inettbl1);

The estimate for that is pretty accurate, 833 rows estimated vs 1000 actual, with the current patch. I'm afraid if we fixed inet_semi_join_selec the way I suggest, the estimate would be smaller, i.e. more wrong. Is there something else in the estimates that accidentally compensates for this currently?

Thoughts?

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to