Hackers, I'm thinking of submitting a patch, and would like to review my design ideas with you all before doing so. I've thought about this problem before, but I can't find any email where I might have already proposed this. If I did, and this is a duplicate, please forgive me. I'm not trying to resubmit the idea if it has already been rejected.
I have been considering applications that maintain some current state information in the face of frequently inserted time series data. Assume you loop as follows, waiting for each new data file coming out of a buffered application: COPY sometable FROM recent_timeseries_data.csv; SELECT some_aggregate_statistics FROM sometable st, another_table, and_another, ... WHERE st.sometimefield > pretty_recent AND ... GROUP BY some_aggregate_columns; If you don't control how frequently that file gets written, you could get lots of really short files frequently, or fewer larger files less often. Even with autovacuum set pretty aggressively on "sometable", you are likely to get a bad plan for the SELECT due to the statistics on "sometable" for the "sometimefield" not taking into account the most recently inserted rows. Some quick EXPLAIN checking verifies that the number of rows predicted for a timeframe later than the newest data as of the most recent ANALYZE will be 1. (I'd be interested in counter examples -- I'm just looking at the results of a quick-and-dirty test.) Updating that loop to perform an ANALYZE between the COPY and the SELECT helps, but at the expense of potentially running ANALYZE too often when the recent_timeseries_data.csv files are short and frequent. Using a stored procedure to conditionally run the analyze seems unnecessarily complicated. Relying on autovacuum to rescue you from bad plans seems foolishly optimistic, since it would need to run right between your COPY of new data and your SELECT over that data. It is unclear how autovacuum could be modified to do this for you. Modifying the statistics system to be predictive based on the state of affairs at the last ANALYZE and the number of changes since then seems more promising, but pretty complicated. I might return to this idea in a future patch, but today I'm proposing something simpler. Would it make sense to add an optional parameter to VACUUM, ANALYZE, and VACUUM ANALYZE that instructs it to only perform the operation if autovacuum would do so under the current conditions? In other words, to consider the PgStat_StatTabEntry's n_dead_tuples and n_live_tuples the same way autovacuum would? Something like: ANALYZE sometable IF AND ONLY IF AUTOVACUUM WOULD; A similar argument can be made for VACUUM, if you are trying to get the visibility map updated prior to the SELECT so that an index only scan will be feasible. As for VACUUM ANALYZE, that has a similar use case, with the downside that you don't know which thresholds to use, the ones for vacuum or for analyze. I think I'd implement it to run the VACUUM ANALYZE if either condition meets autovacuum's requirements. (See autovacuum_vac_scale and autovacuum_anl_scale.) I think the words "IF AND ONLY IF AUTOVACUUM WOULD" should be replaced with a single word and added to the grammar where vacuum_option_elem lists VERBOSE, FREEZE and FULL. Perhaps "OPTIONALLY", or "AUTOVACUUMESQUE", though I'm really hoping somebody has a better suggestion. In the given example, above, the user would likely set the vacuum and analyze scale factors to zero and the thresholds to something they've empirically determined to work well for their purposes. That might be a problem in practice, given that it also impacts autovacuum's choices. Would people prefer that those thresholds be passed as parameters to the command directly? VACUUM sometable OPTIONALLY (vacuum_threshold = 10, vacuum_scale = 0) and only default to autovacuum's settings when not specified? -- Mark Dilger