Statistics improvements for time series data

Mark Dilger Sun, 08 Dec 2019 16:13:05 -0800

Hackers,

I'm thinking of submitting a patch, and would like to review my
design ideas with you all before doing so.  I've thought about
this problem before, but I can't find any email where I might
have already proposed this.  If I did, and this is a duplicate,
please forgive me.  I'm not trying to resubmit the idea if it
has already been rejected.


I have been considering applications that maintain some current
state information in the face of frequently inserted time series
data.  Assume you loop as follows, waiting for each new data file
coming out of a buffered application:

  COPY sometable FROM recent_timeseries_data.csv;
  SELECT some_aggregate_statistics
    FROM sometable st, another_table, and_another, ...
    WHERE st.sometimefield > pretty_recent AND ...
    GROUP BY some_aggregate_columns;

If you don't control how frequently that file gets written, you
could get lots of really short files frequently, or fewer larger
files less often.

Even with autovacuum set pretty aggressively on "sometable", you
are likely to get a bad plan for the SELECT due to the statistics
on "sometable" for the "sometimefield" not taking into account the
most recently inserted rows.  Some quick EXPLAIN checking verifies
that the number of rows predicted for a timeframe later than
the newest data as of the most recent ANALYZE will be 1.  (I'd
be interested in counter examples -- I'm just looking at the
results of a quick-and-dirty test.)

Updating that loop to perform an ANALYZE between the COPY and the
SELECT helps, but at the expense of potentially running ANALYZE
too often when the recent_timeseries_data.csv files are short
and frequent.  Using a stored procedure to conditionally run the
analyze seems unnecessarily complicated.

Relying on autovacuum to rescue you from bad plans seems foolishly
optimistic, since it would need to run right between your COPY of
new data and your SELECT over that data.  It is unclear how
autovacuum could be modified to do this for you.  Modifying the
statistics system to be predictive based on the state of affairs
at the last ANALYZE and the number of changes since then seems
more promising, but pretty complicated.  I might return to this
idea in a future patch, but today I'm proposing something simpler.

Would it make sense to add an optional parameter to VACUUM, ANALYZE,
and VACUUM ANALYZE that instructs it to only perform the operation
if autovacuum would do so under the current conditions?  In other
words, to consider the PgStat_StatTabEntry's n_dead_tuples and
n_live_tuples the same way autovacuum would?  Something like:

  ANALYZE sometable IF AND ONLY IF AUTOVACUUM WOULD;

A similar argument can be made for VACUUM, if you are trying to get
the visibility map updated prior to the SELECT so that an index only
scan will be feasible.  As for VACUUM ANALYZE, that has a similar
use case, with the downside that you don't know which thresholds
to use, the ones for vacuum or for analyze.  I think I'd implement
it to run the VACUUM ANALYZE if either condition meets autovacuum's
requirements.  (See autovacuum_vac_scale and autovacuum_anl_scale.)

I think the words "IF AND ONLY IF AUTOVACUUM WOULD" should be
replaced with a single word and added to the grammar where
vacuum_option_elem lists VERBOSE, FREEZE and FULL.  Perhaps
"OPTIONALLY", or "AUTOVACUUMESQUE", though I'm really hoping
somebody has a better suggestion.

In the given example, above, the user would likely set the vacuum
and analyze scale factors to zero and the thresholds to something
they've empirically determined to work well for their purposes.
That might be a problem in practice, given that it also impacts
autovacuum's choices.  Would people prefer that those thresholds
be passed as parameters to the command directly?

  VACUUM sometable OPTIONALLY (vacuum_threshold = 10, vacuum_scale = 0)

and only default to autovacuum's settings when not specified?

--
Mark Dilger

Statistics improvements for time series data

Reply via email to