On Mon, 7 Sep 2020 at 15:24, Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Thu, Jul 23, 2020 at 11:46 AM Masahiko Sawada > <masahiko.saw...@2ndquadrant.com> wrote: > > > > I've updated the patch so that the stats collector ignores the > > 'update' message if the slot stats array is already full. > > > > This patch needs a rebase. I don't see this patch in the CF app. I > hope you are still interested in working on this.
Thank you for reviewing this patch! I'm still going to work on this patch although I might be slow response this month. > > Review comments: > =============== > 1. > +CREATE VIEW pg_stat_replication_slots AS > + SELECT > + s.name, > + s.spill_txns, > + s.spill_count, > + s.spill_bytes > + FROM pg_stat_get_replication_slots() AS s; > > The view pg_stat_replication_slots should have a column 'stats_reset' > (datatype: timestamp with time zone) as we provide a facitily to reset > the slots. A similar column exists in pg_stat_slru as well, so is > there a reason of not providing it here? I had missed adding the column. Fixed. > > 2. > + </para> > + </sect2> > + > + <sect2 id="monitoring-pg-stat-wal-receiver-view"> > <title><structname>pg_stat_wal_receiver</structname></title> > > It is better to keep one empty line between </para> and </sect2> to > keep it consistent with the documentation of other views. Fixed. > > 3. > <primary>pg_stat_reset_replication_slot</primary> > + </indexterm> > + <function>pg_stat_reset_replication_slot</function> ( > <type>text</type> ) > + <returnvalue>void</returnvalue> > + </para> > + <para> > + Resets statistics to zero for a single replication slot, or for all > + replication slots in the cluster. If the argument is NULL, > all counters > + shown in the > <structname>pg_stat_replication_slots</structname> view for > + all replication slots are reset. > + </para> > > I think the information about the parameter for this function is not > completely clear. It seems to me that it should be the name of the > slot for which we want to reset the stats, if so, let's try to be > clear. Fixed. > > 4. > +pgstat_reset_replslot_counter(const char *name) > +{ > + PgStat_MsgResetreplslotcounter msg; > + > + if (pgStatSock == PGINVALID_SOCKET) > + return; > + > + pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER); > + if (name) > + { > + memcpy(&msg.m_slotname, name, NAMEDATALEN); > + msg.clearall = false; > + } > > Don't we want to verify here or in the caller of this function whether > the passed slot_name is a valid slot or not? For ex. see > pgstat_reset_shared_counters where we return an error if the target is > not valid. Agreed. Fixed. > > 5. > +static void > +pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, > + int len) > +{ > + int i; > + int idx = -1; > + TimestampTz ts; > + > + if (!msg->clearall) > + { > + /* Get the index of replication slot statistics to reset */ > + idx = pgstat_replslot_index(msg->m_slotname, false); > + > + if (idx < 0) > + return; /* not found */ > > Can we add a comment to describe when we don't expect to find the slot > here unless there is no way that can happen? Added. > > 6. > +pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, > + int len) > { > .. > + for (i = 0; i < SLRU_NUM_ELEMENTS; i++) > .. > } > > I think here we need to traverse till nReplSlotStats, not SLRU_NUM_ELEMENTS. Fixed. > > 7. Don't we need to change PGSTAT_FILE_FORMAT_ID for this patch? We > can probably do at the end but better to change it now so that it > doesn't slip from our mind. Yes, changed. > > 8. > @@ -5350,6 +5474,23 @@ pgstat_read_statsfiles(Oid onlydb, bool > permanent, bool deep) > > break; > > + /* > + * 'R' A PgStat_ReplSlotStats struct describing a replication slot > + * follows. > + */ > + case 'R': > + if (fread(&replSlotStats[nReplSlotStats], 1, > sizeof(PgStat_ReplSlotStats), fpin) > + != sizeof(PgStat_ReplSlotStats)) > + { > + ereport(pgStatRunningInCollector ? LOG : WARNING, > + (errmsg("corrupted statistics file \"%s\"", > + statfile))); > + memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats)); > + goto done; > + } > + nReplSlotStats++; > + break; > > Both here and in pgstat_read_db_statsfile_timestamp(), the patch > handles 'R' message after 'D' whereas while writing the 'R' is written > before 'D'. So, I think it is better if we keep the order during read > the same as during write. Changed the code so that it writes 'R' after 'D'. > > 9. While reviewing this patch, I noticed that in > pgstat_read_db_statsfile_timestamp(), if we fail to read ArchiverStats > or SLRUStats, we return 'false' from this function but OTOH, if we > fail to read 'D' or 'R' message, we will return 'true'. I feel the > handling of 'D' and 'R' message is fine because once we read > GlobalStats, we can return the stats_timestamp. So the other two > stands corrected. I understand that this is not directly related to > this patch but if you agree we can do this as a separate patch. It seems to make sense to me. We can set *ts and then read both ArchiverStats and SLRUStats so we can return a valid timestamp even if we fail to read. I've attached both patches: 0001 patch fixes the issue you reported. 0002 patch is the patch that incorporated all review comments. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
0002-Track-statistics-for-spilling-of-changes-from-Reorde.patch
Description: Binary data
0001-Return-true-when-failing-to-read-SLRU-or-Archiver-st.patch
Description: Binary data