On 11 March 2016 at 20:15, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote:
> Craig Ringer wrote: > > Hi all > > > > I think I found a couple of logical decoding issues while writing tests > for > > failover slots. > > > > Despite the docs' claim that a logical slot will replay data "exactly > > once", a slot's confirmed_lsn can go backwards and the SQL functions can > > replay the same data more than once.We don't mark a slot as dirty if only > > its confirmed_lsn is advanced, so it isn't flushed to disk. For failover > > slots this means it also doesn't get replicated via WAL. After a master > > crash, or for failover slots after a promote event, the confirmed_lsn > will > > go backwards. Users of the SQL interface must keep track of the safely > > locally flushed slot position themselves and throw the repeated data > away. > > Unlike with the walsender protocol it has no way to ask the server to > skip > > that data. > > > > Worse, because we don't dirty the slot even a *clean shutdown* causes > slot > > confirmed_lsn to go backwards. That's a bug IMO. We should force a flush > of > > all slots at the shutdown checkpoint, whether dirty or not, to address > it. > > Why don't we mark the slot dirty when confirmed_lsn advances? If we fix > that, doesn't it fix the other problems too? > Yes, it does. That'll cause slots to be written out at checkpoints when they otherwise wouldn't have to be, but I'd rather be doing a little more work in this case. Compared to the disk activity from WAL decoding etc the effect should be undetectable anyway. Andres? Any objection to dirtying a slot when the confirmed lsn advances, so we write it out at the next checkpoint? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services