Hi all, Is there an "official" pairing of LSN values on the publication and subscription sides that should be used to track the delta between the two systems? I ask because Google is full of different pairs being used. I tried to identify the highest level interface points exposed, i.e. what is documented on https://www.postgresql.org/docs/current/replication-origins.html, the pg_stat_subscription table, the pg_stat_publication table and the pg_current_wal_lsn() function on the publisher, but these seem to be barely used.
Am I missing something? Thanks, Shaheed P.S. On a related note, I see a (stalled?) discussion on providing LSN -> timestamp conversion <https://www.postgresql.org/message-id/flat/CAAKRu_bw7Pgw8Mi9LJrBkFvPPHgvVjPphrT8ugbzs-2V0f%2B1Rw%40mail.gmail.com#8540282228634ecd061585867c6275ca>, I'd just like to say that something like that would be very useful. On Tue, 30 Jan 2024 at 11:27, Shaheed Haque <shaheedha...@gmail.com> wrote: > This is great, thank you for posting. I'm currently a subcontinent or two > away from my dev env, but will compare your approach with mine (you are > using some facilities of psql I'm not familiar with). At least you have > confirmed that LSNs are the place to start. > > Thanks again, Shaheed > > > On Tue, 30 Jan 2024, 05:15 Klaus Darilion, <klaus.mailingli...@pernau.at> > wrote: > >> Hi Saheed! >> >> I monitor our replication this way: >> >> 1. Every 10 seconds i fetch the current LSN and write it into a table, >> next with the current timestamp. Further I fetch confirmend LSNs from >> the replication slots and delete old entries in lsn2data table. >> >> calculate_logical_replication_lag.php: >> >> <?php >> >> $path = realpath(dirname(__FILE__) . "/../inc"); >> set_include_path($path . PATH_SEPARATOR . get_include_path()); >> >> require_once('config.php'); >> $config_int['syslogprefix'] = basename(__FILE__); >> require_once('logging.php'); >> >> $dbuser="replication_lag_user"; >> $dbpass="XXXXXXXXXXXXXXXXXXXX"; >> if (!$dbconn = pg_pconnect('host='.$config_int['dbhost'].' >> dbname='.$config_int['dbname'].' user='.$dbuser.' password='.$dbpass)) { >> print "Sorry, database connection failed"; >> exit; >> } >> >> $accuracy = 10; // in seconds >> >> // >> // Preparations: >> // >> // CREATE TABLE lsn2data( >> // lsn pg_lsn PRIMARY KEY, >> // seen timestamp NOT NULL DEFAULT NOW() >> // ); >> // CREATE ROLE replication_lag_user WITH LOGIN PASSWORD >> 'XXXXXXXXXXXXXXXXXXX'; >> // GRANT ALL ON TABLE lsn2data TO replication_lag_user; >> // >> // CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE >> (subscriber name, lag bigint) AS >> // $BODY$ >> // DECLARE >> // subscriber name; >> // BEGIN >> // FOR subscriber IN >> // SELECT slot_name FROM pg_replication_slots >> // LOOP >> // RETURN QUERY SELECT slot_name, EXTRACT(EPOCH FROM >> NOW()-seen)::bigint lag from lsn2data,pg_replication_slots WHERE >> slot_name=subscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC >> LIMIT 1; >> // END LOOP; >> // RETURN; >> // END >> // $BODY$ >> // LANGUAGE plpgsql; >> // >> while (1) { >> $dbq = pg_query("INSERT INTO lsn2data (lsn) VALUES >> (pg_current_wal_lsn())"); >> if ($dbq === FALSE) { >> mylog(LOG_ERROR, "SQL query error: >> ".pg_last_error()."\n"); >> exit(1); >> } >> >> $dbq = pg_query("DELETE FROM lsn2data WHERE lsn < (". >> "SELECT lsn FROM lsn2data WHERE lsn < (". >> "SELECT confirmed_flush_lsn FROM >> pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1". >> ") ORDER BY lsn DESC LIMIT 1". >> ")" >> ); >> if ($dbq === FALSE) { >> mylog(LOG_ERROR, "SQL query error: >> ".pg_last_error()."\n"); >> exit(1); >> } >> sleep($accuracy); >> } >> >> 2. I graph the replications lags (converted from LSN to seconds) in my >> check_mk monitoring: >> >> #!/bin/bash >> >> # >> # Managed by Puppet: >> modules/base/files/monitoring/check_logical_replication_lag.sh >> # >> # Check the logical replication lag and export performance data for each >> subscriber >> # >> >> # exit on error >> #set -e >> >> #Make sure this script only runs one at a time >> ( >> >> ME=$0 >> MEBASE=`basename $0` >> >> mylog () { >> echo "$MEBASE: $1" >> logger -t "$MEBASE" "$1" >> } >> >> flock -x -w 1 200 >> if [ $? != "0" ]; then >> #echo "ERROR: $0 is already running ... exit" >> logger -t "$MEBASE" "ERROR: $0 is already running ... exit" >> exit 1 >> fi >> >> # Do stuff >> >> # Variablen fuer Monitoring >> CMK_SPOOLDIR=/var/lib/check_mk_agent/spool >> CMK_NAME=$MEBASE >> CMK_SPOOLFILE=600_`basename ${CMK_NAME}`.txt >> CMK_HEADER="<<<local>>>" >> TMP_FILE="/tmp/logical_replication_lag.csv" >> >> # Schwellwerte >> warn=300 >> crit=600 >> >> final_output="$CMK_HEADER\nP $CMK_NAME " >> >> # move to a directory where user postgresl may reside (sudo) >> cd /tmp >> >> # Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt >> es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid. >> # Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause. >> rm -f "$TMP_FILE" >> sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM >> get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DESC) >> TO '$TMP_FILE' With CSV" 2>&1> /dev/null >> LC=$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM >> get_replication_lag();" | tr -d ' ') >> >> if [ $LC == "0" ]; then >> echo -e "$CMK_HEADER\n0 $CMK_NAME - No Slaves with Replication >> found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE >> exit 0; >> fi >> >> grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/null >> if [ $? != "0" ]; then >> echo -e "$CMK_HEADER\n2 $CMK_NAME - Postgres Output does not >> seem valid. Please check script $ME and output in $TMP_FILE" > >> $CMK_SPOOLDIR/$CMK_SPOOLFILE >> exit 1; >> fi >> >> # CSV in Array einlesen >> IFS=$'\n' read -d '' -r -a input_file < "$TMP_FILE" >> >> # Auswerten >> maxlag=0 >> for i in "${input_file[@]}"; do >> node=`echo $i | awk -F "," '{print $1}' | tr -- _ -` >> lag=`echo $i | awk -F "," '{print $2}'` >> final_output="$final_output$node=$lag;$warn;$crit|" >> # >> >> https://unix.stackexchange.com/questions/186663/is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers >> maxlag=$(( maxlag > lag ? maxlag : lag )) >> done >> final_output="${final_output}max-lag=$maxlag;$warn;$crit" >> >> # Letztes Pipe Zeichen rausschneiden >> #final_output=`echo $final_output | rev | cut -c 2- | rev` >> >> # Spool File schreiben >> echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE >> logger -t "$MEBASE" "$final_output" >> >> >> ) 200>/tmp/`basename $0`.exclusivelock >> >> >> 3. During initial sync I check the status on the subscriber. Once it has >> synced all tables of the publication, it will send me an email. >> # >> # Managed by Puppet: >> modules/pdns/templates/check_pglogical_subscription.sh.erb >> # >> >> # >> # This script checks and eventually creates the subscription, and wait >> until the initial sync is finished >> # >> >> PUB=regdns2020_pub >> SLEEP=5 >> PREFIX=check_pglogical_subscription.sh >> NUMTAB=175 >> >> SECONDS=0 >> date >> while true; do >> echo "SELECT * from pg_subscription;" | sudo -u postgres psql -t >> regdns | grep -q $PUB >> if [ $? -eq 0 ]; then >> echo "OK: Host is subscribed to '$PUB'. Checking for >> table count ..." >> break >> fi >> echo "ERROR: Host is not subscribed to '$PUB'. Subscribing to >> master ..." >> logger -t $PREFIX "ERROR: Host is not subscribed to '$PUB'. >> Subscribing to master ..." >> echo "CREATE SUBSCRIPTION `hostname -s|tr -- - _` CONNECTION >> 'host=XXXXX dbname=XXXX user=XXXXX password=XXXXXX PUBLICATION >> regdns2020_pub;" | sudo -u postgres psql regdns && touch >> /etc/regdns.schema_subscription.created >> echo "Re-Checking in $SLEEP seconds ..." >> logger -t $PREFIX "Re-Checking in $SLEEP seconds ..." >> sleep $SLEEP >> done >> >> while true; do >> COUNT=$(echo "SELECT count(*) from pg_subscription_rel;" | sudo >> -u postgres psql -t regdns | head -1 | xargs) >> if [ $COUNT -eq $NUMTAB ]; then >> echo "OK: Subscription '$PUB' contains $NUMTAB tables - >> that is OK. Checking for initial-sync status ..." >> logger -t $PREFIX "OK: Subscription '$PUB' contains >> $NUMTAB tables - that is OK. Checking for initial-sync status ..." >> break >> fi >> echo "ERROR: Subscription '$PUB' contains $COUNT tables, but >> should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..." >> logger -t $PREFIX "ERROR: Subscription '$PUB' contains $COUNT >> tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds >> ..." >> sleep $SLEEP >> done >> >> while true; do >> COUNTFIN=$(echo "SELECT count(*) from pg_subscription_rel WHERE >> srsubstate='r';" | sudo -u postgres psql -t regdns | head -1 | xargs) >> if [ $COUNTFIN -eq $NUMTAB ]; then >> echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables >> finished in $SECONDS seconds." >> logger -t $PREFIX "OK: Initial sync of $COUNTFIN/$NUMTAB >> tables finished in $SECONDS seconds." >> echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables >> finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds to >> subscribe" -- root >> break >> fi >> echo "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables >> finished. Re-Checking in $SLEEP seconds ..." >> logger -t $PREFIX "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB >> tables finished. Re-Checking in $SLEEP seconds ..." >> sleep $SLEEP >> done >> >> >> >> regards >> Klaus >> >> >> >> >> Am 2023-10-07 17:31, schrieb Shaheed Haque: >> > Hi, >> > >> > I've been playing with logical replication (currently on PG14), >> > specifically in an AWS RDS Postgres context, but NOT using AWS' own >> > replication tooling. I'm generally familiar with the challenges of >> > distributed systems (such causality, time synchronisation etc), but >> > not especially familiar with PG. >> > >> > In looking at how to tell how a given subscriber has caught up with >> > its publisher, there is plenty of advice around the Web, for example >> > >> https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn >> . >> > Like this example, much advice ends up talking about using separate >> > queries on the publisher and the subscriber to compare LSNs. First, (I >> > think) I understand the core difficulty that comparing LSNs is >> > inherently racy, but given that, I'm a bit unclear as to why a single >> > query on the publisher is not enough...IIUC: >> > >> > * Changes sent from the publisher to the subscriber are >> identified by >> > LSN. >> > * The publisher knows it's own current latest LSN >> > (pg_current_wal_lsn()), but this seems not to be exposed at the >> > subscriber. >> > * The subscriber knows what it has applied locally and even tells >> the >> > publisher (pg_stat_subscription.latest_end_lsn), but it does not seem >> > to be exposed at the publisher. >> > >> > Have I missed something? Is there a way to track the LSN delta (given >> > that this is known to be racy) just by querying one end? >> > >> > Second, how do folk "know" when replication is "done". For example, if >> > the two LSNs continued to match for 1 * replication lag? Or N * >> > replication lag? What would be a plausible N? >> > >> > Third, as we know when logical replication is started, the initial >> > table state is captured in a snapshot, and sent across using COPY >> > TABLE under the covers. Now, let's say that the publisher is idle >> > (i.e. no SQL writes to the user's schema...obviously pg_catalog might >> > change as replication is configured and enabled) and that the >> > replication starts with the publisher as LSN_start. How could one know >> > when the copying is done: >> > >> > * I initially assumed that the publisher's LSN would not change >> from >> > LSN_start, but as the copying proceeds, I see that it DOES change >> > (presumably because there are updates happening to pg_catalog, such as >> > the temporary slots coming and going). >> > * Is there some kind of singleton state on either publisher or >> > subscriber that could be checked to know? (At the moment, I am >> > counting the records in all copied tables). >> > >> > I realise that the knowledge that the publisher is "idle" is a special >> > case, but right now, my test for being "done" is: >> > >> > * Number of records in copied tables matches AND the publisher's >> > pg_stat_subscription matches the subscriber's >> > pg_stat_subscription.latest_end_lsn. >> > >> > Plus or minus the bit about replication lag, is there a better way? >> > >> > Thanks, Shaheed >> >