Re: [HACKERS] Collect frequency statistics for arrays

Noah Misch Thu, 08 Mar 2012 11:23:35 -0800

On Thu, Mar 08, 2012 at 11:30:52AM -0500, Tom Lane wrote:
> Noah Misch <[email protected]> writes:
> > On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
> >> On reflection my idea above is wrong; for example assume that we have a
> >> column with 900 arrays of length 1 and 100 arrays of length 2.  Going by
> >> what I said, we'd reduce the histogram to {1,2}, which might accurately
> >> capture the set of lengths present but fails to show that 1 is much more
> >> common than 2.  However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
> >> would capture the situation perfectly in one-tenth the space that the
> >> current logic does.
> 
> > Granted.  When the next sample finds 899/101 instead, though, the 
> > optimization
> > vanishes.
> 
> No, you missed my next point.  That example shows that sometimes a
> smaller histogram can represent the situation with zero error, but in
> all cases a smaller histogram can represent the situation with perhaps
> more error than a larger one.  Since we already have a defined error
> tolerance, we should try to generate a histogram that is as small as
> possible while still not exceeding the error tolerance.
> 
> Now, it might be that doing that is computationally impractical, or
> too complicated to be reasonable.  But it seems to me to be worth
> looking into.


Yes, I did miss your point.

One characteristic favoring this approach is its equal applicability to both
STATISTIC_KIND_HISTOGRAM and STATISTIC_KIND_DECHIST.

> > If you want to materially narrow typical statistics, Alexander's
> > proposal looks like the way to go.  I'd guess array columns always
> > having DEC <= default_statistics_target are common enough to make that
> > representation the dominant representation, if not the only necessary
> > representation.
> 
> Well, I don't want to have two representations; I don't think it's worth
> the complexity.  But certainly we could consider switching to a
> different representation if it seems likely to usually be smaller.

Perhaps some heavy array users could provide input: what are some typical
length ranges among arrays in your applications on which you use "arr &&
const", "arr @> const" or "arr <@ const" searches?

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Collect frequency statistics for arrays

Reply via email to