Hi all, While digging again into the support for pgstats flushes across checkpoints, mainly to avoid autovacuum giving up on tables after a crash if stats entries cannot be found, I've been reminded about this point raised by Heikki: https://www.postgresql.org/message-id/54fdfc3e-74f6-4ea4-b844-05aa66b39...@iki.fi
The point is quite simple: depending on the timing of the stats file flush we may finish with orphaned entries because some objects may be dropped after a checkpoint, leaving stats entries when these should be removed. This is an issue dependent on the pgstats file flush at checkpoint time, and we cannot by design reach this state currently as stats are flushed only at shutdown, once all the other backends have exited after the shutdown checkpoint is finished. There is a sanity check for that in pgstat_write_statsfile(): Assert(!ps->dropped); Up to v14 and 5891c7a8ed, there was a mechanism in place able to do garbage collection of stats entries when vacuuming a database, where OIDs of various object types are collected and messages were sent to the stats collector to clean up things. With pgstats now in shared memory, we could do something else if we want to have some stats for autovacuum post-crash to take some decisions rather than giving up. Some ideas I can think of, some of them linked to the timing of the pgstats file flush, but not all: 1) Reimplement the same thing as ~14 in autovacuum workers with a flush of the pgstats file at each checkpoint, with a loop in all the stats kinds that would trigger a periodic cleanup of potential garbage entries, based on a new callback. One option would be a callback to return a set of (stat_kind,objid) on the database vacuumed, then loop through these to clean up everything in a last step once all the objects are known. We could also let that up to each stats kind, dropping objects they don't know about. This is far from optimal as there would be a window where we still have orphaned entries: these would not be marked as dropped, still their OIDs could get reused later. As a whole it makes the existing pgstats facility *less* robust compared to what we have now. For tables this could be made efficient so as we cross check the list of table OIDs with the contents of the central dshash once we're done with the initial scan of pg_class, but that's O(N^2) for the pgstats dshash based on the number of entries. I'm not really comfortable reintroducing that because it does not scale well with many objects and I know of some users with a lot of.. Objects. It adds more work to autovacuum and I don't want any of that because we want autovacuum to be cheaper. 2) The main issue I am trying to tackle is autovacuum giving up on tables if there are no stats entries, so we could add *some* WAL logging of the relation stats that are relevant for autovacuum, then replay them. I think that the correct approach here is to introduce one new RMGR for pgstats, giving to each stats kind the possibility to call a routine able to do WAL logging of *some* of its data (custom structure size, custom data size), and insert records associated to their stats kind. We are going to need a new optional callback defined by a stats kind to be able to force some actions at replay, so as stats kinds can decide what to do with the data in the record. Generation of WAL records has to happen pgstat_report_stat() through the flush callback of each stats kind when the stats stored locally are synced with shared memory. There is a different reason for that: stats are flushed when backends shut down, and we are still able to generate some WAL at this stage. An advantage of this design is to be able to decide which portions of which stats kind is worth replicating, and we can take a step-by-step approach we what data and how much data we want to replay (for example for tables we should not care about replicating the number scans). Another benefit of this design is for custom stats kind: these can call the pgstats RMGR to pass down some data and define their own callback to use at replay. If we do that, flushing the stats file at each checkpoint is not actually mandatory: the most critical stats could be in WAL. 3) Do nothing, and accept that that we could finish with orphaned entries as a possible scenario? This still requires more thinking about how to deal with orphaned entries as these would not be marked as dropped, still their OIDs could be reused after a wraparound. This makes the current assumptions in the pgstats machinery more brittle. Among all these ideas, 2) is by far the most relevant approach to me, because even if we do not flush pgstats at checkpoint, we can still keep around relevant stats when performing crash recovery, while copying around some stats on standbys. It should be possible for a given stats kind to do a different action depending on if we're in standby mode or just in crash recovery. And that would take care of this autovacuum problem post-crash: we could have stats to help in the decision of if a table should be vacuum or not. Note that the implementation can be done in multiple steps, like: - Adding the whole WAL pgstats facility and some tests related to it (WAL logging with injection points for variable and fixed-numbered stats in a custom stats kind). - Deal about the autovacuum and relation stats part. - Open the door for more replication of stats data, whatever that may be. Comments, thoughts or tomatoes? -- Michael
signature.asc
Description: PGP signature