On Thu, Jan 25, 2018 at 07:38:37AM -0500, Stephen Frost wrote: > Michael, all, > > * Michael Paquier (michael.paqu...@gmail.com) wrote: >> On Wed, Jan 24, 2018 at 12:43:51PM -0500, Stephen Frost wrote: >>> * chenhj (chjis...@163.com) wrote: >>>> At 2018-01-23 09:56:48, "Stephen Frost" <sfr...@snowman.net> wrote: >>>>> I've only read through the thread to try and understand what's going on >>>>> and the first thing that comes to mind is that you're changing >>>>> pg_rewind to not remove the WAL from before the divergence (split) >>>>> point, but I'm not sure why. As noted, that WAL isn't needed for >>>>> anything (it's from before the split, after all), so why keep it? Is >>>>> there something in this optimization that depends on the old WAL being >>>>> there and, if so, what and why? >>>> >>>> After run pg_rewind, the first startup of postgres will do crash recovery. >>>> And crash recovery will begin from the previous redo point preceding the >>>> divergence. >>>> So, the WAL after the redo point and before the divergence is needed. >>> >>> Right. >> >> Most of the time, and particularly since v11 has removed the need to >> retain more past segments than one completed checkpoint, those segments >> have less chances to be on the source server, limiting more the impact >> of the patch discussed on this thread. > > Good point.
Actually, that is worse that than, because after doing a promotion one has to issue a checkpoint on the promoted server so as to update its control file (pg_rewind fetches the new TLI number from the on-disk file), so we can never ensure that needed WAL segments will be on the primary. Hence the proposed patch loses most of its value, because there will always be a WAL segment hole anyway. It can also help in reducing the load in creating new segments if min_wal_size is large enough on the rewound server but that is a narrow benefit. In short, it seems really to me that we should reject the approach as proposed, and replace it with something that prevents the fetching of any WAL segments from the source server. I think that we could consider as well removing all WAL segments on the primary from the point WAL forked, as those created between the last checkpoint before WAL forked up to the point where WAL forked are useful anyway. But those are bonus points to keep a minimalistic amount of space for the rewound node as they finish by being recycled anyway. For those reasons I think that the patch should be marked as returned with feedback. >> Yes, superuser is necessary now, if we could get to a point where only a >> replication permission is needed that would be nice. Now we could do >> things differently. We could have a system role dedicated to pg_rewind >> which works only on the functions from genfile.c that pg_rewind needs, >> in order to leverage the need of a superuser. > > Actually, the other work I'm doing nearby wrt removing the explicit > superuser() checks in those functions would allow a non-superuser role > to be created which could be used by pg_rewind. We could even add an > explicit 'pg_rewind' default role if we wanted to, but is that the route > we want to go with this or should we be thinking about changing > pg_rewind to use the replication protocol and a replication user > instead..? The only reason that it doesn't today is that there isn't an > easy way for it to do so with the existing replication protocol, as I > understand it. Yes, actually with the piece of work we are doing, we don't need to work more on the replication protocol, and pg_rewind does not require heavy refactoring either, so from a maintenance point of view we would gain much more in the long term by using a dedicated role. If your patch to have the file-read catalog user gets in, we should add in the docs of pg_rewind a small set of ACL commands to allow pg_rewind to work with a non-superuser role as long is it gains access to the minimal set of functions necessary. > Then again, there's this whole question of if we should even keep the > replication protocol or if we should be getting rid of it in favor of > have regular connections that can support replication. There's been > discussion of that recently, as I recall, though I can't remember > where. Hm. I don't recall this one.. > Having the rewound server decide by itself what it needs actually sounds > like it would be better and would allow whatever the restore command is > to fetch any missing WAL necessary (which it might be able to do more > efficiently, from an archive server and possibly even in parallel as > compared to pg_rewind doing it serially from the primary...). We don't > have any particularly easy way to prevent needed WAL from disappearing > off of the primary anyway, so it seems like it would be necessary to > have a working restore command for pg_rewind to be reliable. While we > could create a physical replication slot for pg_rewind when it starts, > that may already be too late. The only approach I can think of that makes unnecessary an external archive is to have a slot that guarantees that WAL segments down to the previous checkpoint before promotion are kept around to allow a rewound node to use them. For planned failovers, things can be controlled even today as you know when to create a slot. Unplanned failovers are where the problems come. One way I think can address any problems is to have what I would call a failover slot. The concept is simple: when a node is in recovery have it create a new physical replication slot at each restart point which has a user-defined name, controlled by a GUC, and do the creation automatically. If the parameter is empty, no slots are created. At the beginning of a restart point, the slot is removed, so once the node is promoted the slot will not get removed, and it would keep around WAL segments down to the last redo point before promotion automatically. When you don't want to set up a WAL archive, but still want to ensure that a rewound node is able to catch up with its promoted primary, just enable this failover slot and then set up primary_slot_name in recovery.conf to use it. -- Michael
signature.asc
Description: PGP signature