Re: [HACKERS] multivariate statistics (v19)

Dilip Kumar Thu, 26 Jan 2017 01:44:36 -0800

On Thu, Jan 5, 2017 at 3:27 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> Thanks. Those plans match my experiments with the TPC-H data set, although
> I've been playing with the smallest scale (1GB).
>
> It's not very difficult to make the estimation error arbitrary large, e.g.
> by using perfectly correlated (identical) columns.


I have done an initial review for ndistint and histogram patches,
there are few review comments.

ndistinct
---------
1. Duplicate statistics:
postgres=# create statistics s with (ndistinct) on (a,c) from t;
2017-01-07 16:21:54.575 IST [63817] ERROR:  duplicate key value
violates unique constraint "pg_mv_statistic_name_index"
2017-01-07 16:21:54.575 IST [63817] DETAIL:  Key (staname,
stanamespace)=(s, 2200) already exists.
2017-01-07 16:21:54.575 IST [63817] STATEMENT:  create statistics s
with (ndistinct) on (a,c) from t;
ERROR:  duplicate key value violates unique constraint
"pg_mv_statistic_name_index"
DETAIL:  Key (staname, stanamespace)=(s, 2200) already exists.

For duplicate statistics, I think we can check the existence of the
statistics and give more meaningful error code something statistics
"s" already exist.

2. Typo
+ /*
+ * Sort the attnums, which makes detecting duplicies somewhat
+ * easier, and it does not hurt (it does not affect the efficiency,
+ * onlike for indexes, for example).
+ */
/onlike/unlike

3. Typo
/*
 * Find attnims of MV stats using the mvoid.
 */
int2vector *
find_mv_attnums(Oid mvoid, Oid *relid)

/attnims/attnums


histograms
--------------
+ if (matches[i] == MVSTATS_MATCH_FULL)
+ s += mvhist->buckets[i]->ntuples;
+ else if (matches[i] == MVSTATS_MATCH_PARTIAL)
+ s += 0.5 * mvhist->buckets[i]->ntuples;

Isn't it will be better that take some percentage of the bucket based
on the number of distinct element for partial matching buckets.


+static int
+update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
+  int2vector *stakeys,
+  MVSerializedHistogram mvhist,
+  int nmatches, char *matches,
+  bool is_or)
+{
+ int i;

For each clause we are processing all the buckets, can't we use some
data structure which can make multi-dimensions information searching
faster.
Something like HTree, RTree, Maybe storing histogram in these formats
will be difficult?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

Reply via email to