On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.kla...@aklaver.com> wrote:
> On 5/27/19 9:59 PM, Tom K wrote: > > Hey Guy's, > > > > > > I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3 > > RHEL 7.6 VM's. Every now and then the underlying storage crashes taking > > out the cluster. On recovery, PostgreSQL tends to come up while other > > databases just blow up. That is until now. > > The above is one master and two standby servers connected via streaming > replication, correct? > Is there WAL archiving set up? > Correct. Master election occurs through Patroni. WAL level is set to: wal_level = 'replica' So no archiving. > > > > > > After the most recent crash 2-3 weeks ago, the cluster is now running > > into this message but I'm not able to make heads or tails out of why > > it's throwing this: > > So you have not been able to run the cluster the past 2-3 weeks or is > that more recent? > Haven't been able to bring this PostgresSQL cluster up ( run the cluster ) since 2-3 weeks ago. Tried quite a few combinations of options to recover this. No luck. Had storage failures earlier, even with corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to come up each time without any recovery effort on my part. > > When you refer to history files below are you talking about WAL files or > something else? > > Is this: > > "recovery command file "recovery.conf" specified neither > primary_conninfo nor restore_command" > > true? > True. recovery.conf is controlled by Patroni. Contents of this file remained the same for all the cluster nodes with the exception of the primary_slot_name: [root@psql01 postgresql-patroni-etcd]# cat recovery.conf primary_slot_name = 'postgresql0' standby_mode = 'on' recovery_target_timeline = 'latest' [root@psql01 postgresql-patroni-etcd]# [root@psql02 postgres-backup]# cat recovery.conf primary_slot_name = 'postgresql1' standby_mode = 'on' recovery_target_timeline = 'latest' [root@psql02 postgres-backup]# [root@psql03 postgresql-patroni-backup]# cat recovery.conf primary_slot_name = 'postgresql2' standby_mode = 'on' recovery_target_timeline = 'latest' [root@psql03 postgresql-patroni-backup]# I've made a copy of the root postgres directory over to another location so when troubleshooting, I can always revert to the first state the cluster was in when it failed. Thx, TK > > > > > > > |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni > > --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT > > [10101]LOG:listening onIPv4 address "192.168.0.108",port > > 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket > > "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log > > output tologging collector process 2019-05-2308:26:34.591EDT > > [10101]HINT:Future log output will appear indirectory > > "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni > > --config-file=/data/patroni/postgresql.conf --hot_standby=on > > --listen_addresses=192.168.0.108 --max_worker_processes=8 > > --max_locks_per_transaction=64 --wal_level=replica > > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 > > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > > --max_replication_slots=10 > > --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening > > onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT > > [10105]LOG:listening onUnix socket > > "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log > > output tologging collector process 2019-05-2308:26:45.740EDT > > [10105]HINT:Future log output will appear indirectory > > "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni > > --config-file=/data/patroni/postgresql.conf --hot_standby=on > > --listen_addresses=192.168.0.1 --max_worker_processes=8 > > --max_locks_per_transaction=64 --wal_level=replica > > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 > > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > > --max_replication_slots=10 > > --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening > > onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT > > [10188]LOG:listening onUnix socket > > "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log > > output tologging collector process 2019-05-2308:29:36.040EDT > > [10188]HINT:Future log output will appear indirectory "log".| > > > > The database logs only print out this: > > > > |2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly > > poll the pg_wal subdirectory tocheckforfiles placed > > there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode > > 2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest > > 2019-05-2308:25:50.215EDT [10089]LOG:shutting down > > 2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down > > 2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down > > inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT > > [10094]WARNING:recovery command file"recovery.conf"specified neither > > primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT > > [10094]HINT:The databaseserver will regularly poll the pg_wal > > subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT > > [10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT > > [10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT > > [10095]LOG:shutting down 2019-05-2308:26:17.831EDT > > [10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT > > [10103]LOG:databasesystem was shut down inrecovery at > > 2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery > > command file"recovery.conf"specified neither primary_conninfo nor > > restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver > > will regularly poll the pg_wal subdirectory tocheckforfiles placed > > there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode > > 2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest > > 2019-05-2308:26:43.032EDT [10104]LOG:shutting down > > 2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down > > 2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down > > inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT > > [10107]WARNING:recovery command file"recovery.conf"specified neither > > primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT > > [10107]HINT:The databaseserver will regularly poll the pg_wal > > subdirectory tocheckforfiles placed there.| > > > > and when connecting, I get this cryptic message: > > > > |[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the > > databasesystem isstarting up [root@psql01 log]#| > > > > There's no space issue here. The cluster stays in recovery. This state > > doesn't change and the DB is inaccessible. > > > > So I ran the above in debug mode ( -d 5 ) and got this: > > > > > > |DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot > > allocate memory| > > > > so I freed memory on the underlying physical this VM sits on but no > > luck. Same memory error and same result as above. > > > > When patroni is started on just this one node (troubleshooting purposes) > > I get the message: > > > > |[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root > > 22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres > > 25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml > > postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni > > --config-file=/data/patroni/postgresql.conf --hot_standby=on > > --listen_addresses=192.168.0.108 --max_worker_processes=8 > > --max_locks_per_transaction=64 --wal_level=replica > > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 > > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > > --max_replication_slots=10 --max_connections=100postgres > > 25352533000:42?00:00:00postgres:postgres:logger process postgres > > 25362533000:42?00:00:00postgres:postgres:startup process waiting > > for000000010000000000000008root 26642039000:42pts/000:00:00grep > > --color=auto -Ei patroni|postgres[root@psql01 ~]#| > > > > The system prints "postgres: postgres: startup process waiting for > > 000000010000000000000008" > > > > Seems it's trying to replay the history log but can't? Running an strace > > on the file reveals that it's looking for the following: > > > > |read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily > > unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily > > unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such > > fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3| > > > > But this file doesn't exist in the directory on this particular cluster > > node (psql01). I copy the missing history files off another cluster > > node to this one and it does move forward but just get's stuck on > > another history file. But same error message as above on psql01. > > > > So I give up and try to recover the secondary node (psql02). But this > > yields this error: > > > > |PANIC:replicationcheckpointhas wrong magic 0instead of307747550| > > > > and I'm not able to find very much to move me forward on this one either. > > > > So again I give up after a few hours of trying to recover psql02 and > > move on to the third node. On the third cluster node (pgsql03) I get > > this error which I'm also not able to find much info on: > > > > |FATAL:syntax error inhistory file:f2W| > > > > Hoping I'm able to recover the data from at least one of the nodes, make > > it primary, and get it to replicate to the others. > > > > Thx, TK > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >