On Tue, May 22, 2018 at 2:42 PM, Maxim Boguk <maxim.bo...@gmail.com> wrote:
> > > On Tue, May 22, 2018 at 10:30 PM, Andres Freund <and...@anarazel.de> > wrote: > >> Hi, >> >> On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote: >> > On Tue, May 22, 2018 at 9:47 PM, Andres Freund <and...@anarazel.de> >> wrote: >> > > > select relfrozenxid from pg_class where relname='pg_authid'; >> > > > relfrozenxid >> > > > -------------- >> > > > 2863429136 >> >> > select txid_current(); >> > txid_current >> > -------------- >> > 41995913769 >> >> So that's an xid of 3341208114, if you leave the epoch out. What's >> >> autovacuum_freeze_max_age set to in that cluster? >> > > postgres=# show autovacuum_freeze_max_age; > autovacuum_freeze_max_age > --------------------------- > 200000000 > (default value I think) > > > >> Can you show pg_controldata output, and >> >> relminmxid from that cluster? >> > > postgres@db:~$ /usr/lib/postgresql/9.6/bin/pg_controldata -D > /var/lib/postgresql/9.6/main > pg_control version number: 960 > Catalog version number: 201608131 > Database system identifier: 6469368654711450114 > Database cluster state: in production > pg_control last modified: Tue 22 May 2018 10:20:14 PM MSK > Latest checkpoint location: CCB5/F9C37950 > Prior checkpoint location: CCB0/43F316B0 > Latest checkpoint's REDO location: CCB1/6706BD88 > Latest checkpoint's REDO WAL file: 000000010000CCB100000067 > Latest checkpoint's TimeLineID: 1 > Latest checkpoint's PrevTimeLineID: 1 > Latest checkpoint's full_page_writes: on > Latest checkpoint's NextXID: 9:3341161759 > Latest checkpoint's NextOID: 190071899 > Latest checkpoint's NextMultiXactId: 59416233 > Latest checkpoint's NextMultiOffset: 215588532 > Latest checkpoint's oldestXID: 2814548646 > Latest checkpoint's oldestXID's DB: 16400 > Latest checkpoint's oldestActiveXID: 3341161759 > Latest checkpoint's oldestMultiXid: 54264778 > Latest checkpoint's oldestMulti's DB: 16400 > Latest checkpoint's oldestCommitTsXid:2814548646 > Latest checkpoint's newestCommitTsXid:3341161758 > Time of latest checkpoint: Tue 22 May 2018 10:05:16 PM MSK > Fake LSN counter for unlogged rels: 0/1 > Minimum recovery ending location: 0/0 > Min recovery ending loc's timeline: 0 > Backup start location: 0/0 > Backup end location: 0/0 > End-of-backup record required: no > wal_level setting: replica > wal_log_hints setting: on > max_connections setting: 2000 > max_worker_processes setting: 8 > max_prepared_xacts setting: 0 > max_locks_per_xact setting: 64 > track_commit_timestamp setting: on > Maximum data alignment: 8 > Database block size: 8192 > Blocks per segment of large relation: 131072 > WAL block size: 8192 > Bytes per WAL segment: 16777216 > Maximum length of identifiers: 64 > Maximum columns in an index: 32 > Maximum size of a TOAST chunk: 1996 > Size of a large-object chunk: 2048 > Date/time type storage: 64-bit integers > Float4 argument passing: by value > Float8 argument passing: by value > Data page checksum version: 0 > > postgres=# select datname,datfrozenxid,datminmxid from pg_database order > by datname; > datname | datfrozenxid | datminmxid > -----------+--------------+------------ > ** | 2815939794 | 54265194 > postgres | 2863429136 | 54280819 > template0 | 3148297669 | 59342106 > template1 | 2816765546 | 59261794 > > > > > >> >> I might be daft here, but it's surely curious that the relfrozenxid from >> the error and pg_catalog are really different (number of digits): >> catalog: 2863429136 >> error: 248712603 >> >> >> > About gdb bt - it's tricky because it is mission critical master db of >> > huge project. >> > I'll will try promote backup replica and check is issue persist there >> and >> > if yes - we will have our playground for a while, but it will require >> > sometime to arrange. >> >> You should be ok to just bt that in the running cluster, but I >> definitely understand if you don't want to do that... I'd appreciate if >> you set up the a playground, because this seems like something that'll >> reappear. >> > > My gdb kunf-fu exceptionally rusty now, so I better play in sandbox. > And if error did't exist on fresh promoted replica it will give us useful > info as well. > > -- > Maxim Boguk > Senior Postgresql DBA > http://dataegret.com/ <http://www.postgresql-consulting.com/> > > Phone RU: +7 985 433 0000 > Phone UA: +380 99 143 0000 > Phone AU: +61 45 218 5678 > > LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b > Skype: maxim.boguk > > "Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно > когда я так делаю ещё раз?" > > All - We just noticed that this exact same problem has now returned for us. It has exactly the same symptoms as my original post (including the fact that a snapshot does not have the issue). I am open to all suggestions as to troubleshooting this further! Thank you, Jeremy