On Fri, Feb 17, 2017 at 12:45 AM, Simon Riggs <si...@2ndquadrant.com> wrote: > Feeling happier about this for now at least.
Thanks! > I think we need to document how this works more in README or header > comments. That way I can review it against what it aims to do rather > than what I think it might do. I have added a bunch of new comments to explain in the -v2 patch (see reply to Abhijit). Please let me know if you think I need to add still more. I'm especially interested in your feedback on the block of comments above the line: + LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp()); Specifically, your feedback on the sufficiency of this (LSN, time) pair + filtering out repeat LSNs as an approximation of the time this LSN was flushed. > e.g. We need to document what replay_lag represents. Does it include > write_lag and flush_lag, or is it the time since the flush_lag. i.e. > do I add all 3 together to get the full lag, or would that cause me to > double count? I have included full descriptions of exactly what the 3 times represent in the user documentation in the -v2 patch. > How sensitive is this? Does the lag spike quickly and then disappear > again quickly? If we're sampling this every N seconds, will we get a > realistic viewpoint or just a random sample? In my testing it seems to move fairly smoothly so I think sampling every N seconds would be quite effective and would not be 'noisy'. The main time it jumps quickly is at the end of a large data load, when a slow standby finally reaches the end of its backlog; you see it climb slowly up and up while the faster primary is busy generating WAL too fast for it to apply, but then if the primary goes idle the standby eventually catches up. The high lag number sometimes lingers for a bit and then pops down to a low number when new WAL arrives that can be applied quickly. It seems like a very accurate depiction of what is really happening so I like that. I would love to hear other opinions and feedback/testing experiences! > Should we smooth the > value, or present preak info? Hmm. Well, it might be interesting to do online exponential moving averages, similar to the three numbers Unix systems present for load. On the other hand, I'm amazed no one has complained that I'm making pg_stat_replication ridiculously wide already, and users/monitoring system could easy do that kind of thing themselves, and the number doesn't seem to jumping/noisy/in-need-of-smoothing. Same would go for logging over time; seems like an external monitoring tool's bailiwick. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers