On 01/26/2017 10:43 AM, Dilip Kumar wrote:
histograms
--------------
+ if (matches[i] == MVSTATS_MATCH_FULL)
+ s += mvhist->buckets[i]->ntuples;
+ else if (matches[i] == MVSTATS_MATCH_PARTIAL)
+ s += 0.5 * mvhist->buckets[i]->ntuples;
Isn't it will be better that take some percentage of the bucket based
on the number of distinct element for partial matching buckets.
I don't think so, for the same reason why ineq_histogram_selectivity()
in selfuncs.c uses
binfrac = 0.5;
for partial bucket matches - it provides minimum average error. Even if
we knew the number of distinct items in the bucket, we have no idea what
the distribution within the bucket looks like. Maybe 99% of the bucket
are covered by a single distinct value, maybe all the items are squashed
on one side of the bucket, etc.
Moreover we don't really know the number of distinct values in the
bucket - we only know the number of distinct items in the sample, and
only while building the histogram. I don't think it makes much sense to
estimate the number of distinct items in a bucket, because the buckets
contain only very few rows so the estimates would be wildly inaccurate.
+static int
+update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
+ int2vector *stakeys,
+ MVSerializedHistogram mvhist,
+ int nmatches, char *matches,
+ bool is_or)
+{
+ int i;
For each clause we are processing all the buckets, can't we use some
data structure which can make multi-dimensions information searching
faster.
>
No, we're not processing all buckets for each clause. We're' only
processing buckets that were not "ruled out" by preceding clauses.
That's the whole point of the bitmap.
For example for condition (a=1) AND (b=2), the code will first evaluate
(a=1) on all buckets, and then (b=2) but only on buckets where (a=1) was
evaluated as true. Similarly for OR clauses.
>
Something like HTree, RTree, Maybe storing histogram in these formats
will be difficult?
Maybe, but I don't want to do that in the first version. I'm not opposed
to doing that in the future, if we find out the v1 histograms are not
efficient (I don't think we will, based on tests I did while working on
the patch). Support for other histogram implementations is pretty much
why there is 'type' field in the struct.
For now I think we should stick with the simple implementation.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers