Hello Hackers,
I recently analyzed an incident where a major lag in synchronous replication blocked a number of synchronous backends. I found myself looking at backends that, according to pg_stat_activity, were neither waiting nor idle but yet they didn't finish their work. As it turns out, the major waiting loop for syncrep updates the processtitle, but is silent within postgres and stat_activity. It seems misleading that commited but waiting backends are 'active' although there is little done apart from waiting. > # select pid, waiting, state, substr(query,1,6) from pg_stat_activity ; > pid | waiting | state | substr > -------+---------+--------+-------- > 26294 | f | active | END; > 26318 | f | active | create > 26323 | f | active | insert > 26336 | f | active | insert (output of waiting statements [vanilla]) While 'active' is technically correct for a backend that is commited but waiting for replication in terms of 'not beeing available for new tasks', it also implies that a backend is dealing with the issue at hand. The remote host however is out of our clusters control, hence all signs should be pointing to the standby-host. I suggest adding a new state to pg_stat_activity.state for backends that are waiting for their synchronous commit to be flushed on the remote host. I chose 'waiting for synchronous replication' for now. One should refrain from the waiting flag at this point as there is no waiting done on internal processes. Instead the backend waits for factors beyond our clusters control to change. > # select pid, waiting, state, substr(query,1,6) from pg_stat_activity ; > pid | waiting | state | substr > ------+---------+-------------------------------------+-------- > 3360 | f | waiting for synchronous replication | END; > 3465 | f | waiting for synchronous replication | create > 3477 | f | waiting for synchronous replication | insert > 3489 | f | waiting for synchronous replication | insert (output of waiting statements [patched]) patch attached regards, Julian Schauder
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index e64b7ef..458ae0f 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -642,6 +642,11 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser </listitem> <listitem> <para> + <literal>waiting for synchronous replication</>: The backend is waiting for its transaction to be flushed on a synchronous standby. + </para> + </listitem> + <listitem> + <para> <literal>idle</>: The backend is waiting for a new client command. </para> </listitem> diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 325239d..b6ee1c3 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -45,7 +45,7 @@ #include "postgres.h" #include <unistd.h> - +#include <pgstat.h> #include "access/xact.h" #include "miscadmin.h" #include "replication/syncrep.h" @@ -151,6 +151,16 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN) set_ps_display(new_status, false); new_status[len] = '\0'; /* truncate off " waiting ..." */ } + /* + * Alter state in pg_stat before entering the loop. + * As with updating the ps display it is save to assume that we'll wait + * at least for a short time. Hence updating to a waiting state seems + * appropriate even without exactly checking if waiting is required. + * However, we avoid using the waiting-flag at this point as there is + * no lock to wait for. + */ + + pgstat_report_activity(STATE_WAITINGFORREPLICATION,NULL); /* * Wait for specified LSN to be confirmed. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index f7c9bf6..84d67e0 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -663,6 +663,9 @@ pg_stat_get_activity(PG_FUNCTION_ARGS) case STATE_IDLEINTRANSACTION_ABORTED: values[4] = CStringGetTextDatum("idle in transaction (aborted)"); break; + case STATE_WAITINGFORREPLICATION: + values[4] = CStringGetTextDatum("waiting for synchronous replication"); + break; case STATE_DISABLED: values[4] = CStringGetTextDatum("disabled"); break; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 9ecc163..ab1befc 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -692,6 +692,7 @@ typedef enum BackendState STATE_IDLEINTRANSACTION, STATE_FASTPATH, STATE_IDLEINTRANSACTION_ABORTED, + STATE_WAITINGFORREPLICATION, STATE_DISABLED } BackendState;
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers