On Mon, Jun 25, 2018 at 12:41 PM, Andres Freund <and...@anarazel.de> wrote:
> Hi, > > On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote: > > I am hoping someone here can shed some light on this issue - I apologize > if > > this isn't the right place to ask this but I'm almost some of you all > were > > involving in pgq's dev and might be able to answer this. > > > > We are actually running 2 replication technologies on a few of our dbs, > > skytools and pglogical. Although we are moving towards only using > logical > > decoding-based replication, right now we have both for different > purposes. > > > > There seems to be a table rewrite happening on table pgq.event_58_1 that > > has happened twice, and it ends up in the decoding stream, resulting in > the > > following error: > > > > ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation > > OID" > > > > In retracing what happened, we discovered that this relfilenode was > > rewritten. But somehow, it is ending up in the logical decoding stream > as > > is "undecodable". This is pretty disastrous because the only way to fix > it > > really is to advance the replication slot and lose data. > > > > The only obvious table rewrite I can find in the pgq codebase is a > truncate > > in pgq.maint_rotate_tables.sql. But there isn't anything surprising > > there. If anyone has any ideas as to what might cause this so that we > > could somehow mitigate the possibility of this happening again until we > > move off pgq, that would be much appreciated. > > I suspect the issue might be that pgq does some updates to catalog > tables. Is that indeed the case? > I also suspected this. The only case I found of this is that it is doing deletes and inserts to pg_autovacuum. I could not find anything quickly otherwise but I'm not sure if I'm missing something in some of the C code. Thanks, Jeremy