On Wed, May 29, 2019 at 10:28 AM Adrian Klaver <adrian.kla...@aklaver.com>
wrote:

> On 5/28/19 6:59 PM, Tom K wrote:
> >
> >
> > On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.kla...@aklaver.com
> > <mailto:adrian.kla...@aklaver.com>> wrote:
> >
>
> >
> > Correct.  Master election occurs through Patroni.  WAL level is set to:
> >
> > wal_level = 'replica'
> >
> > So no archiving.
> >
> >
> >
> >      >
> >      > After the most recent crash 2-3 weeks ago, the cluster is now
> >     running
> >      > into this message but I'm not able to make heads or tails out of
> why
> >      > it's throwing this:
> >
> >     So you have not been able to run the cluster the past 2-3 weeks or is
> >     that  more recent?
> >
> >
> > Haven't been able to bring this PostgresSQL cluster up ( run the cluster
> > ) since 2-3 weeks ago.  Tried quite a few combinations of options to
> > recover this.  No luck.  Had storage failures earlier, even with
> > corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to
> > come up each time without any recovery effort on my part.
> >
> >
> >     When you refer to history files below are you talking about WAL
> >     files or
> >     something else?
> >
> >     Is this:
> >
> >     "recovery command file "recovery.conf" specified neither
> >     primary_conninfo nor restore_command"
> >
> >     true?
> >
> >
> > True. recovery.conf is controlled by Patroni.  Contents of this file
> > remained the same for all the cluster nodes with the exception of the
> > primary_slot_name:
> >
> > [root@psql01 postgresql-patroni-etcd]# cat recovery.conf
> > primary_slot_name = 'postgresql0'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root@psql01 postgresql-patroni-etcd]#
> >
> > [root@psql02 postgres-backup]# cat recovery.conf
> > primary_slot_name = 'postgresql1'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root@psql02 postgres-backup]#
> >
> > [root@psql03 postgresql-patroni-backup]# cat recovery.conf
> > primary_slot_name = 'postgresql2'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root@psql03 postgresql-patroni-backup]#
> >
> > I've made a copy of the root postgres directory over to another location
> > so when troubleshooting, I can always revert to the first state the
> > cluster was in when it failed.
>
> I have no experience with Patroni so I will be of no help there. You
> might get more useful information from:
>
> https://github.com/zalando/patroni
> Community
>
> There are two places to connect with the Patroni community: on github,
> via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
> you're using Patroni, or just interested, please join us.
>

Will post there as well.  Thank you.  My thinking was to post here first
since I suspect the Patroni community will simply refer me back here given
that the PostgreSQL errors are originating directly from PostgreSQL.


>
> That being said, can you start the copied Postgres instance without
> using the Patroni instrumentation?
>

Yes, that is something I have been trying to do actually.  But I hit a dead
end with the three errors above.

So what I did is to copy a single node's backed up copy of the data files
to */data/patroni* of the same node ( this is the psql data directory as
defined through patroni ) of the same node then ran this ( psql03 =
192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

This resulted in one of the 3 messages above.  Hence the post here.  If I
can start a single instance, I should be fine since I could then 1)
replicate over to the other two or 2) simply take a dump, reinitialize all
the databases then restore the dump.

Using the above procedure I get one of three error messages when using the
data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W

And I can't start any one of them.


>
> >
> > Thx,
> > TK
> >
>
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Reply via email to