On Wed, May 29, 2019 at 10:28 AM Adrian Klaver <adrian.kla...@aklaver.com> wrote:
> On 5/28/19 6:59 PM, Tom K wrote: > > > > > > On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.kla...@aklaver.com > > <mailto:adrian.kla...@aklaver.com>> wrote: > > > > > > > Correct. Master election occurs through Patroni. WAL level is set to: > > > > wal_level = 'replica' > > > > So no archiving. > > > > > > > > > > > > After the most recent crash 2-3 weeks ago, the cluster is now > > running > > > into this message but I'm not able to make heads or tails out of > why > > > it's throwing this: > > > > So you have not been able to run the cluster the past 2-3 weeks or is > > that more recent? > > > > > > Haven't been able to bring this PostgresSQL cluster up ( run the cluster > > ) since 2-3 weeks ago. Tried quite a few combinations of options to > > recover this. No luck. Had storage failures earlier, even with > > corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to > > come up each time without any recovery effort on my part. > > > > > > When you refer to history files below are you talking about WAL > > files or > > something else? > > > > Is this: > > > > "recovery command file "recovery.conf" specified neither > > primary_conninfo nor restore_command" > > > > true? > > > > > > True. recovery.conf is controlled by Patroni. Contents of this file > > remained the same for all the cluster nodes with the exception of the > > primary_slot_name: > > > > [root@psql01 postgresql-patroni-etcd]# cat recovery.conf > > primary_slot_name = 'postgresql0' > > standby_mode = 'on' > > recovery_target_timeline = 'latest' > > [root@psql01 postgresql-patroni-etcd]# > > > > [root@psql02 postgres-backup]# cat recovery.conf > > primary_slot_name = 'postgresql1' > > standby_mode = 'on' > > recovery_target_timeline = 'latest' > > [root@psql02 postgres-backup]# > > > > [root@psql03 postgresql-patroni-backup]# cat recovery.conf > > primary_slot_name = 'postgresql2' > > standby_mode = 'on' > > recovery_target_timeline = 'latest' > > [root@psql03 postgresql-patroni-backup]# > > > > I've made a copy of the root postgres directory over to another location > > so when troubleshooting, I can always revert to the first state the > > cluster was in when it failed. > > I have no experience with Patroni so I will be of no help there. You > might get more useful information from: > > https://github.com/zalando/patroni > Community > > There are two places to connect with the Patroni community: on github, > via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If > you're using Patroni, or just interested, please join us. > Will post there as well. Thank you. My thinking was to post here first since I suspect the Patroni community will simply refer me back here given that the PostgreSQL errors are originating directly from PostgreSQL. > > That being said, can you start the copied Postgres instance without > using the Patroni instrumentation? > Yes, that is something I have been trying to do actually. But I hit a dead end with the three errors above. So what I did is to copy a single node's backed up copy of the data files to */data/patroni* of the same node ( this is the psql data directory as defined through patroni ) of the same node then ran this ( psql03 = 192.168.0.118 ): # sudo su - postgres $ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --listen_addresses=192.168.0.118 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100 --hot_standby=on --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5 This resulted in one of the 3 messages above. Hence the post here. If I can start a single instance, I should be fine since I could then 1) replicate over to the other two or 2) simply take a dump, reinitialize all the databases then restore the dump. Using the above procedure I get one of three error messages when using the data files of each node: [ PSQL01 ] postgres: postgres: startup process waiting for 000000010000000000000008 [ PSQL02 ] PANIC:replicationcheckpointhas wrong magic 0 instead of 307747550 [ PSQL03 } FATAL:syntax error inhistory file:f2W And I can't start any one of them. > > > > > Thx, > > TK > > > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >