Hello Hackers,

I recently analyzed an incident where a major lag in synchronous replication
blocked a number of synchronous backends. I found myself looking at backends
that, according to pg_stat_activity, were neither waiting nor idle but yet they
didn't finish their work.

As it turns out, the major waiting loop for syncrep updates the processtitle,
but is silent within postgres and stat_activity. It seems misleading that
commited but waiting backends are 'active' although there is little done apart
from waiting.

> # select pid, waiting, state, substr(query,1,6) from pg_stat_activity ;
>   pid  | waiting | state  | substr
> -------+---------+--------+--------
>  26294 | f       | active | END;
>  26318 | f       | active | create
>  26323 | f       | active | insert
>  26336 | f       | active | insert
(output of waiting statements [vanilla])

While 'active' is technically correct for a backend that is commited but waiting
for replication in terms of 'not beeing available for new tasks', it also
implies that a backend is dealing with the issue at hand. The remote host
however is out of our clusters control, hence all signs should be pointing to
the standby-host.


I suggest adding a new state to pg_stat_activity.state for backends that are
waiting for their synchronous commit to be flushed on the remote host.
I chose 'waiting for synchronous replication' for now.

One should refrain from the waiting flag at this point as there is no waiting
done on internal processes. Instead the backend waits for factors beyond our
clusters control to change.


> # select pid, waiting, state, substr(query,1,6) from pg_stat_activity ;
>  pid  | waiting |                state                | substr
> ------+---------+-------------------------------------+--------
>  3360 | f       | waiting for synchronous replication | END;
>  3465 | f       | waiting for synchronous replication | create
>  3477 | f       | waiting for synchronous replication | insert
>  3489 | f       | waiting for synchronous replication | insert
(output of waiting statements [patched])


patch attached


regards,

Julian Schauder
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..458ae0f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -642,6 +642,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   
11:34   0:00 postgres: ser
          </listitem>
          <listitem>
           <para>
+           <literal>waiting for synchronous replication</>: The backend is 
waiting for its transaction to be flushed on a synchronous standby.
+          </para>
+         </listitem>
+         <listitem>
+          <para>
            <literal>idle</>: The backend is waiting for a new client command.
           </para>
          </listitem>
diff --git a/src/backend/replication/syncrep.c 
b/src/backend/replication/syncrep.c
index 325239d..b6ee1c3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -45,7 +45,7 @@
 #include "postgres.h"
 
 #include <unistd.h>
-
+#include <pgstat.h>
 #include "access/xact.h"
 #include "miscadmin.h"
 #include "replication/syncrep.h"
@@ -151,6 +151,16 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
                set_ps_display(new_status, false);
                new_status[len] = '\0'; /* truncate off " waiting ..." */
        }
+       /*
+        * Alter state in pg_stat before entering the loop.
+        * As with updating the ps display it is save to assume that we'll wait
+        * at least for a short time. Hence updating to a waiting state seems
+        * appropriate even without exactly checking if waiting is required.
+        * However, we avoid using the waiting-flag at this point as there is
+        * no lock to wait for.
+        */
+
+       pgstat_report_activity(STATE_WAITINGFORREPLICATION,NULL);
 
        /*
         * Wait for specified LSN to be confirmed.
diff --git a/src/backend/utils/adt/pgstatfuncs.c 
b/src/backend/utils/adt/pgstatfuncs.c
index f7c9bf6..84d67e0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -663,6 +663,9 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
                                case STATE_IDLEINTRANSACTION_ABORTED:
                                        values[4] = CStringGetTextDatum("idle 
in transaction (aborted)");
                                        break;
+                               case STATE_WAITINGFORREPLICATION:
+                                       values[4] = 
CStringGetTextDatum("waiting for synchronous replication");
+                                       break;
                                case STATE_DISABLED:
                                        values[4] = 
CStringGetTextDatum("disabled");
                                        break;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9ecc163..ab1befc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -692,6 +692,7 @@ typedef enum BackendState
        STATE_IDLEINTRANSACTION,
        STATE_FASTPATH,
        STATE_IDLEINTRANSACTION_ABORTED,
+       STATE_WAITINGFORREPLICATION,
        STATE_DISABLED
 } BackendState;
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to