On Thu, Apr 8, 2021 at 7:30 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Wed, Apr 7, 2021 at 2:51 PM vignesh C <vignes...@gmail.com> wrote: > > > > @@ -4069,6 +4069,24 @@ pgstat_read_statsfiles(Oid onlydb, bool > permanent, bool deep) > * slot follows. > */ > case 'R': > + /* > + * There is a remote scenario where one of the replication slots > + * is dropped and the drop slot statistics message is not > + * received by the statistic collector process, now if the > + * max_replication_slots is reduced to the actual number of > + * replication slots that are in use and the server is > + * re-started then the statistics process will not be aware of > + * this. To avoid writing beyond the max_replication_slots > + * this replication slot statistic information will be skipped. > + */ > + if (max_replication_slots == nReplSlotStats) > + { > + ereport(pgStatRunningInCollector ? LOG : WARNING, > + (errmsg("skipping \"%s\" replication slot statistics as > pg_stat_replication_slots does not have enough slots", > + NameStr(replSlotStats[nReplSlotStats].slotname)))); > + goto done; > + } > > I think we might truncate some valid slots here. I have another idea > to fix this case which is that while writing, we first write the > 'nReplSlotStats' and then write each slot info. Then while reading we > can allocate memory based on the required number of slots. Later when > startup process sends the slots, we can remove the already dropped > slots from this array. What do you think?
IIUC there are two problems in the case where the drop message is lost: 1. Writing beyond the end of replSlotStats. This can happen if after restarting the number of slots whose stats are stored in the stats file exceeds max_replication_slots. Vignesh's patch addresses this problem. 2. The stats for the new slot are not recorded. If the stats for already-dropped slots remain in replSlotStats, the stats for the new slot cannot be registered due to the full of replSlotStats. This can happen even when after restarting the number of slots whose stats are stored in the stat file does NOT exceed max_replication_slots as well as even during the server running. The patch doesn’t address this problem. (If this happens, we will have to reset all slot stats since pg_stat_reset_replication_slot() cannot remove the slot stats with the non-existing name). I think we can use HTAB to store slot stats and have pg_stat_get_replication_slot() inquire about stats by the slot name, resolving both problems. By using HTAB we're no longer concerned about the problem of writing stats beyond the end of the replSlotStats array. Instead, we have to consider how and when to clean up the stats for already-dropped slots. We can have the startup process send slot names at startup time, which borrows the idea proposed by Amit. But maybe we need to consider the case again where the message from the startup process is lost? Another idea would be to have pgstat_vacuum_stat() check the existing slots and call pgstat_report_replslot_drop() if the slot in the stats file doesn't exist. That way, we can continuously check the stats for already-dropped slots. I've written a PoC patch for the above idea; using HTAB and cleaning up slot stats at pgstat_vacuum_stat(). The patch can be applied on top of 0001 patch Vignesh proposed before[1]. Please note that this cannot resolve the problem of ending up accumulating the stats to the old slot if the slot is re-created with the same name and the drop message is lost. To deal with this problem I think we would need to use something unique identifier for each slot instead of slot name. [1] https://www.postgresql.org/message-id/CALDaNm195xL1bZq4VHKt%3D-wmXJ5kC4jxKh7LXK%2BpN7ESFjHO%2Bw%40mail.gmail.com Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
0001-POC-Use-HTAB-for-replication-slot-stats.patch
Description: Binary data