On Mon, Nov 24, 2025 at 3:37 PM Álvaro Herrera <[email protected]> wrote: > > On 2025-Nov-24, Michael Banck wrote: > > > In general I doubt how much those gauges (as oppposed to counters) only > > pertaining to the last checkpoint are useful in pg_stat_checkpointer. > > What would be the use case for those two values? > > I think it's useful to know how long checkpoint has to work. It's a bit > lame to have only one duration (the last one), but at least with this > arrangement you can have external monitoring software connect to the > server, extract that value and save it somewhere else. Monitoring > systems do this all the time, and we've been waiting for a better > implementation to store monitoring data inside Postgres for years. I > think we shouldn't block this proposal just because of this issue, > because it can clearly be useful. > > However, I'm not sure I'm very interested in knowing only the duration > of the checkpoint. I mean, much of the time the duration is going to be > whatever fraction of the checkpoint timeout you have as > checkpoint_completion_target, right? Which includes sleeps. So I think > you really want two durations: one is the duration itself, and the other > is what fraction of that did the checkpointer sleep in order to achieve > that duration. So you know how much time checkpointer spent trying to > get the operating system do stuff rather than just sit there waiting. > We already have that data, kinda, in write_time and sync_time, but those > are cumulative rather than just for the last one. (I guess you can have > the monitoring system compute the deltas as it finds each new > checkpoint.) I'm not sure how good this system is.
Thank you for the detailed thoughts. I agree that having only the last checkpoint’s duration is limited, but it still gives monitoring tools a concrete value they can sample and store over time, which is better than relying only on counters and logs. I will try whether separating total duration and actual active write/sync time (vs. sleep time) can be exposed in a more clearer way, as that seems useful for deeper diagnosis. > In the past, I looked at a couple of monitoring dashboards offered by > cloud vendors, searching for anything valuable in terms of checkpoints. > What I saw was very disappointing -- mostly just "how many checkpoints > per minute", which is mostly flat zero with periodic spikes. Totally > useless. Does anybody know if some vendor has good charts for this? > Also, if we were to add this new proposed duration, how could these > charts improve? I will look into this in more depth. Will let you know if I find something concrete. Regards Soumya
