On Tue, Sep 8, 2020 at 7:53 AM Masahiko Sawada <masahiko.saw...@2ndquadrant.com> wrote: > > On Mon, 7 Sep 2020 at 15:24, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > This patch needs a rebase. I don't see this patch in the CF app. I > > hope you are still interested in working on this. > > Thank you for reviewing this patch! > > I'm still going to work on this patch although I might be slow > response this month. >
Comments on the latest patch: ============================= 1. +CREATE VIEW pg_stat_replication_slots AS + SELECT + s.name, + s.spill_txns, + s.spill_count, + s.spill_bytes, + s.stats_reset + FROM pg_stat_get_replication_slots() AS s; You forgot to update the docs for the new parameter. 2. @@ -5187,6 +5305,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep) for (i = 0; i < SLRU_NUM_ELEMENTS; i++) slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp; + /* + * Set the same reset timestamp for all replication slots too. + */ + for (i = 0; i < max_replication_slots; i++) + replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp; + I don't understand why you have removed the above code from the new version of the patch? 3. pgstat_recv_resetreplslotcounter() { .. + ts = GetCurrentTimestamp(); + for (i = 0; i < nReplSlotStats; i++) + { + /* reset entry with the given index, or all entries */ + if (msg->clearall || idx == i) + { + /* reset only counters. Don't clear slot name */ + replSlotStats[i].spill_txns = 0; + replSlotStats[i].spill_count = 0; + replSlotStats[i].spill_bytes = 0; + replSlotStats[i].stat_reset_timestamp = ts; + } + } .. I don't like this coding pattern as in the worst case we need to traverse all the slots to reset a particular slot. This could be okay for a fixed number of elements as we have in SLRU but here it appears quite inefficient. We can move the reset of stats part to a separate function and then invoke it from the place where we need to reset a particular slot and the above place. 4. +pgstat_replslot_index(const char *name, bool create_it) { .. + replSlotStats[nReplSlotStats].stat_reset_timestamp = GetCurrentTimestamp(); .. } Why do we need to set the reset timestamp on the creation of slot entry? 5. @@ -3170,6 +3175,13 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) spilled++; } + /* update the statistics */ + rb->spillCount += 1; + rb->spillBytes += size; + + /* Don't consider already serialized transactions. */ + rb->spillTxns += rbtxn_is_serialized(txn) ? 0 : 1; We can't increment the spillTxns in the above way because now sometimes we do serialize before streaming and in that case we clear the serialized flag after streaming, see ReorderBufferTruncateTXN. So, the count can go wrong. Another problem is currently the patch call UpdateSpillStats only from begin_cb_wrapper which means it won't consider streaming transactions (streaming transactions that might have spilled). If we consider the streamed case along with it, we can probably keep this counter up-to-date because in the above place we can check if the txn is once serialized or streamed, we shouldn't increment the counter. I think we need to merge Ajin's patch for streaming stats [1] and fix the issue. I have not checked his patch so it might need a rebase and or some changes. 6. @@ -322,6 +321,9 @@ ReplicationSlotCreate(const char *name, bool db_specific, /* Let everybody know we've modified this slot */ ConditionVariableBroadcast(&slot->active_cv); + + /* Create statistics entry for the new slot */ + pgstat_report_replslot(NameStr(slot->data.name), 0, 0, 0); } .. .. @@ -683,6 +685,18 @@ ReplicationSlotDropPtr(ReplicationSlot *slot) ereport(WARNING, (errmsg("could not remove directory \"%s\"", tmppath))); + /* + * Report to drop the replication slot to stats collector. Since there + * is no guarantee the order of message arrival on an UDP connection, + * it's possible that a message for creating a new slot arrives before a + * message for removing the old slot. We send the drop message while + * holding ReplicationSlotAllocationLock to reduce that possibility. + * If the messages arrived in reverse, we would lose one statistics update + * message. But the next update message will create the statistics for + * the replication slot. + */ + pgstat_report_replslot_drop(NameStr(slot->data.name)); + Similar to drop message, why don't we send the create message while holding the ReplicationSlotAllocationLock? [1] - https://www.postgresql.org/message-id/CAFPTHDZ8RnOovefzB%2BOMoRxLSD404WRLqWBUHe6bWqM5ew1bNA%40mail.gmail.com -- With Regards, Amit Kapila.