Hi Jeremy, On 08/28/2018 10:46 PM, Jeremy Finzel wrote: > We have hit this error again, and we plan to snapshot the database > as to be able to do whatever troubleshooting we can. > > > I am happy to report that we were able to get replication working again > by running snapshots of the systems in question on servers running the > latest point release 9.6.10, and replication simply works and skips over > these previously erroring relfilenodes. So whatever fixes were made in > this point release to logical decoding seems to have fixed the issue. >
Interesting. So you were running 9.6.9 before, it triggered the issue (and was not able to recover). You took a filesystem snapshot, started a 9.6.10 on the snapshot, and it recovered without hitting the issue? I quickly went through the commits in 9.6 branch between 9.6.9 and 9.6.10, looking for stuff that might be related, and these three commits seem possibly related (usually because of invalidations, vacuum, ...): 6a46aba1cd6dd7c5af5d52111a8157808cbc5e10 Fix bugs in vacuum of shared rels, by keeping their relcache entries current. da10d6a8a94eec016fa072d007bced9159a28d39 Fix "base" snapshot handling in logical decoding 0a60a291c9a5b8ecdf44cbbfecc4504e3c21ef49 Add table relcache invalidation to index builds. But it's hard to say if/which of those commits did the trick, without more information. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services