I'd like to propose two enhancements to the PostgreSQL code, but I'm not sure 
if this is the correct mailing list.  So if it's not then please let me know 
where I need to post this.

These are monitoring-centric enhancement requests since I'm trying to implement 
accurate monitoring in a secure fashion.

* General monitoring:
We have a need for a "monitoring" role in PostgreSQL that has read-only access 
to any "pg_stat" view.  As of 9.4, only a super-user can read all columns of 
"pg_stat_activity", "pg_stat_replication", and "pg_stat_archiver" (there may be 
other restricted views as well).  These views provide critical insight on how 
well the cluster is operating and what is going on.

There appears to be only two ways to gain access to these views:

  1.  grant super-user to my monitoring user
  2.  write custom functions owned by a super-user with "SECURITY DEFINER" and 
grant access to my monitoring user

Option 1 is too big of a security risk to grant to a monitoring user.  If the 
monitoring system gets compromised then the DB will be at risk too.

Option 2 requires creating, deploying, and maintaining sets of monitoring 
functions, which is a decent chunk of work in a large environment with dozens 
or hundreds of deployments, many running different versions of postgres 
possibly needing custom versions of the functions.  When you add the 
bureaucracy of a large IT organization and SOX and PCI compliance requirements 
it ends up being a PITA implementing or changing these functions when you only 
have a small DBA team.


* Streaming Replication Monitoring:
Make the "pg_stat_replication" view more persistent (maybe keep the rows for 24 
hours or have a registration process?).

There appears to be no way to *accurately* monitor streaming replication via 
SQL alone currently.  This is due to three different problems:

  1.  When a standby gets disconnected from the master then the corresponding 
row is immediately removed from pg_stat_replication on the master.  Once the 
row is gone you cannot tell via simple SQL whether a standby is behind or not 
(or that it even existed at all) without storing prior values and extrapolating 
from them.
  2.  On the standby, if the WAL streaming process gets disconnected from the 
master then it does not report that it is behind because 
pg_last_xlog_receive_location() has not been updated from the master.  The 
standby has no idea how far ahead the master has gotten and just blindly 
reports the last value.
  3.  On a "quiet" system there may not be any update activity on the master 
for a long time, which makes the pg_last_xact_replay_timestamp() function 
report an ever-increasing interval.  So it is not useable for accurately 
measuring lag on quiet systems.

Ideally the master should be able to report standby lag time via SQL, even when 
there has been a disruption in connectivity with a standby.

The only accurate method I have found to measure standby lag is to create a 
synthetic update that runs periodically.  This works, but is less than ideal 
and requires adding a table to every cluster (which then has to be vacuumed 
frequently too) and writing and maintaining a process to update it and purge it.


These two enhancements would go a long way in making it easier to monitor 
PostgreSQL clusters and replication because it would eliminate a lot of custom 
coding requirements and enable us to pull metrics directly via simple SQL.

If anyone can provide insight on how I could accomplish these in a simple 
manner by other means then I'm all ears!

Thanks,
Rob

Reply via email to