Hi all, Peter Eisentraut <peter.eisentr...@enterprisedb.com> writes: > I want to reactivate $subject. I took Petr Jelinek's patch from [0], > rebased it, added a bit of testing. It basically works, but as > mentioned in [0], there are various issues to work out.
Thanks for working on that topic, I believe it's an important part of Postgres' HA story. > The idea is that the standby runs a background worker to periodically fetch > replication slot information from the primary. On failover, a logical > subscriber would then ideally find up-to-date replication slots on the new > publisher and can just continue normally. Is there a case to be made about doing the same thing for physical replication slots too? That's what pg_auto_failover [1] does by default: it creates replication slots on every node for every other node, in a way that a standby Postgres instance now maintains a replication slot for the primary. This ensures that after a promotion, the standby knows to retain any and all WAL segments that the primary might need when rejoining, at pg_rewind time. > The previous thread didn't have a lot of discussion, but I have gathered > from off-line conversations that there is a wider agreement on this > approach. So the next steps would be to make it more robust and > configurable and documented. I suppose part of the configuration would then include taking care of physical slots. Some people might want to turn that off and use the Postgres 13+ ability to use the remote primary restore_command to fetch missing WAL files, instead. Well, people who have setup an archiving system, anyway. > As I said, I added a small test case to > show that it works at all, but I think a lot more tests should be added. I > have also found that this breaks some seemingly unrelated tests in the > recovery test suite. I have disabled these here. I'm not sure if the patch > actually breaks anything or if these are just differences in timing or > implementation dependencies. This patch adds a LIST_SLOTS replication > command, but I think this could be replaced with just a SELECT FROM > pg_replication_slots query now. (This patch is originally older than when > you could run SELECT queries over the replication protocol.) Given the admitted state of the patch, I didn't focus on tests. I could successfully apply the patch on-top of current master's branch, and cleanly compile and `make check`. Then I also updated pg_auto_failover to support Postgres 15devel [2] so that I could then `make NODES=3 cluster` there and play with the new replication command: $ psql -d "port=5501 replication=1" -c "LIST_SLOTS;" psql:/Users/dim/.psqlrc:24: ERROR: XX000: cannot execute SQL commands in WAL sender for physical replication LOCATION: exec_replication_command, walsender.c:1830 ... I'm not too sure about this idea of running SQL in a replication protocol connection that you're mentioning, but I suppose that's just me needing to brush up on the topic. > So, again, this isn't anywhere near ready, but there is already a lot here > to gather feedback about how it works, how it should work, how to configure > it, and how it fits into an overall replication and HA architecture. Maybe the first question about configuration would be about selecting which slots a standby should maintain from the primary. Is it all of the slots that exists on both the nodes, or a sublist of that? Is it possible to have a slot with the same name on a primary and a standby node, in a way that the standby's slot would be a completely separate entity from the primary's slot? If yes (I just don't know at the moment), well then, should we continue to allow that? Other aspects of the configuration might include a list of databases in which to make the new background worker active, and the polling delay, etc. Also, do we want to even consider having the slot management on a primary node depend on the ability to sync the advancing on one or more standby nodes? I'm not sure to see that one as a good idea, but maybe we want to kill it publically very early then ;-) Regards, -- dim Author of “The Art of PostgreSQL”, see https://theartofpostgresql.com [1]: https://github.com/citusdata/pg_auto_failover [2]: https://github.com/citusdata/pg_auto_failover/pull/838