Re: Failed transaction statistics to measure the logical replication progress

Greg Nancarrow Wed, 03 Nov 2021 17:54:06 -0700

On Tue, Nov 2, 2021 at 12:18 AM [email protected]
<[email protected]> wrote:
>
> On Thursday, October 28, 2021 11:19 PM I wrote:
> > I've created a new patch that extends pg_stat_subscription_workers to 
> > include
> > other transaction statistics.
> >
> > Note that this patch depends on v18 patch-set in [1]...
> Rebased based on the v19 in [1].
> Also, fixed documentation a little and made tests tidy.
> FYI, the newly included TAP test(027_worker_xact_stats.pl) is stable
> because I checked that 100 times of its execution in a tight loop all passed.
>


I have done some basic testing of the patch and have some initial
review comments:

(1) I think this patch needs to be in "git format-patch" format, with
a proper commit message that describes the purpose of the patch and
the functionality it adds, and any high-level design points (something
like the overview given in the initial post, and accounting for the
subsequent discussion points and updated functionality).

(2) doc/src/sgml/monitoring.sgml
There are some grammatical issues in the current description. I
suggest changing it to something like:

BEFORE:
+      <entry>At least one row per subscription, showing about
transaction statistics and error summary that
AFTER:
+      <entry>At least one row per subscription, showing transaction
statistics and information about errors that

(2) doc/src/sgml/monitoring.sgml
The current description seems a little confusing.
Per subscription, it shows the transaction statistics and any last
error info from tablesync/apply workers? If this is the case, I'd
suggest the following change:

BEFORE:
+   one row per subscription for transaction statistics and summary of the last
+   error reported by workers applying logical replication changes and workers
+   handling the initial data copy of the subscribed tables.
AFTER:
+   one row per subscription, showing corresponding transaction statistics and
+   information about the last error reported by workers applying
logical replication
+   changes or by workers handling the initial data copy of the
subscribed tables.

(3) xact_commit
I think that the "xact_commit" column should be named
"xact_commit_count" or "xact_commits".
Similarly, I think "xact_error" should be named "xact_error_count" or
"xact_errors", and "xact_aborts" should be named "xact_abort_count" or
"xact_aborts".

(4) xact_commit_bytes

+       Amount of transactions data successfully applied in this subscription.
+       Consumed memory for xact_commit is displayed.

I find this description a bit confusing. "Consumed memory for
xact_commit" seems different to "transactions data".
Could the description be something like: Amount of data (in bytes)
successfully applied in this subscription, across "xact_commit_count"
transactions.

(5)
I'd suggest some minor rewording for the following:

BEFORE:
+       Number of transactions failed to be applied and caught by table
+       sync worker or main apply worker in this subscription.
AFTER:
+       Number of transactions that failed to be applied by the table
+       sync worker or main apply worker in this subscription.

(6) xact_error_bytes
Again, it's a little confusing referring to "consumed memory" here.
How about rewording this, something like:

BEFORE:
+       Amount of transactions data unsuccessfully applied in this subscription.
+       Consumed memory that past failed transaction used is displayed.
AFTER:
+       Amount of data (in bytes) unsuccessfully applied in this
subscription by the last failed transaction.

(7)
The additional information provided for "xact_abort_bytes" needs some
rewording, something like:

BEFORE:
+       Increase <literal>logical_decoding_work_mem</literal> on the publisher
+       so that it exceeds the size of whole streamed transaction
+       to suppress unnecessary consumed network bandwidth in addition to change
+       in memory of the subscriber, if unexpected amount of streamed
transactions
+       are aborted.
AFTER:
+       In order to suppress unnecessary consumed network bandwidth, increase
+       <literal>logical_decoding_work_mem</literal> on the publisher so that it
+       exceeds the size of the whole streamed transaction, and
additionally increase
+       the available subscriber memory, if an unexpected amount of
streamed transactions
+       are aborted.

(8)
Suggested update:

BEFORE:
+ *  Tell the collector that worker transaction has finished without problem.
AFTER:
+ *  Tell the collector that the worker transaction has successfully completed.

(9) src/backend/postmaster/pgstat.c
I think that the GID copying is unnecessarily copying the whole GID
buffer or using an additional strlen().
It should be changed to use strlcpy() to match other code:

BEFORE:
+ /* get the gid for this two phase operation */
+ if (command == LOGICAL_REP_MSG_PREPARE ||
+   command == LOGICAL_REP_MSG_STREAM_PREPARE)
+   memcpy(msg.m_gid, prepare_data->gid, GIDSIZE);
+ else if (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
+   memcpy(msg.m_gid, commit_data->gid, GIDSIZE);
+ else /* rollback prepared */
+   memcpy(msg.m_gid, rollback_data->gid, GIDSIZE);
AFTER:
+ /* get the gid for this two phase operation */
+ if (command == LOGICAL_REP_MSG_PREPARE ||
+   command == LOGICAL_REP_MSG_STREAM_PREPARE)
+   strlcpy(msg.m_gid, prepare_data->gid, sizeof(msg.m_gid));
+ else if (command == LOGICAL_REP_MSG_COMMIT_PREPARED)
+   strlcpy(msg.m_gid, commit_data->gid, sizeof(msg.m_gid));
+ else /* rollback prepared */
+   strlcpy(msg.m_gid, rollback_data->gid, sizeof(msg.m_gid));


BEFORE:
+ strlcpy(prepared_txn->gid, msg->m_gid, strlen(msg->m_gid) + 1);
AFTER:
+ strlcpy(prepared_txn->gid, msg->m_gid, sizeof(prepared_txn->gid));

BEFORE:
+ memcpy(key.gid, msg->m_gid, strlen(msg->m_gid));
AFTER:
+ strlcpy(key.gid, msg->m_gid, sizeof(key.gid));

BEFORE:
+ memcpy(key.gid, gid, strlen(gid));
AFTER:
+ strlcpy(key.gid, gid, sizeof(key.gid));

(10) src/backend/replication/logical/worker.c
Some suggested rewording:

BEFORE:
+ * size of streaming transaction resources because it have used the
AFTER:
+ * size of streaming transaction resources because it has used the


BEFORE:
+ * tradeoff should not be good. Also, add multiple values
+ * at once in order to reduce the number of this function call.
AFTER:
+ * tradeoff would not be good. Also, add multiple values
+ * at once in order to reduce the number of calls to this function.

(11) update_apply_change_size()
Shouldn't this function be declared static?

(12) stream_write_change()

+ streamed_entry->xact_size = streamed_entry->xact_size + total_len;
/* update */

could be simply written as:

+ streamed_entry->xact_size += total_len; /* update */


Regards,
Greg Nancarrow
Fujitsu Australia

Re: Failed transaction statistics to measure the logical replication progress

Reply via email to