[HACKERS] Thoughts on statistics for continuously advancing columns

Josh Berkus Tue, 29 Dec 2009 18:54:27 -0800

All,

One of our clients is having query plan issues with a table with a
continuously advancing timestamp column (i.e. one with default now()).
The newest rows, which are the most in demand, are always estimated to
be fewer than they are or even non-existant.  As a result, the user has
to analyze the table every hour ... and it's a very large table.


I've seen this in a lot of other databases, both with timestamp columns
and with SERIALs -- both of which are very common table structures.
>From my reading of the planner code, the problem seems to be the
histgram bounds ... if a requested value is above the high bound, it's
assumed to be extremely uncommon or not exist.   This leads to bad plans
if analyze hasn't been run very recently.

My thoughts on dealing with this intelligently without a major change to
statstics gathering went along these lines:

1. add columns to pg_statistic to hold estimates of upper and lower
bounds growth between analyzes.

2. every time analyze is run, populate these columns with 1/2 of the
proprotion of values above or below the previously stored bounds,
averaged with the existing value for the new columns.

3. use this factor instead of the existing algorithm to calculate the
row estimate for out-of-bounds values.

This is obviously a very rough idea, but I wanted to get feedback on the
general problem and my approach before going further with it.

Thanks!

--Josh Berkus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Thoughts on statistics for continuously advancing columns

Reply via email to