On Sat, Aug 30, 2014 at 7:04 AM, Chris Hundt <ch...@judicata.com> wrote:
> (One of Patrick's coworkers here; thanks a lot for your assistance) > > Just in case you wanted this as well, I ran > > psql 'replication=1 dbname=XXX host=127.0.0.1 port=5432 user=XXX > password=XXX' -c 'IDENTIFY_SYSTEM;' > > (5432 is the stuck replica and 5445 is the pipe to the working replica) > > systemid | timeline | xlogpos > ---------------------+----------+-------------- > 5964163898407843711 | 1 | 174/B76D16A8 > > > On Fri, Aug 29, 2014 at 7:26 PM, Patrick Krecker <patr...@judicata.com> > wrote: > >> >> >> >> On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <and...@2ndquadrant.com> >> wrote: >> >>> [FWIW: proper quoting makes answering easier and thus more likely] >>> >>> On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote: >>> > I ran the following on the local endpoint of spiped: >>> > >>> > while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c >>> "select >>> > modtime, pg_last_xlog_receive_location(), >>> pg_last_xlog_replay_location() >>> > from replication_time"; done; >>> > >>> > And the same command on production and I was able to verify that the >>> xlogs >>> > for a given point in time were the same (modtime is updated every >>> second by >>> > an upstart job): >>> > >>> > spiped from office -> production: >>> > modtime | pg_last_xlog_receive_location | >>> > pg_last_xlog_replay_location >>> > >>> ----------------------------+-------------------------------+------------------------------ >>> > 2014-08-29 15:23:25.563766 | 177/2E80C9F8 | >>> 177/2E80C9F8 >>> > >>> > Ran directly on production replica: >>> > modtime | pg_last_xlog_receive_location | >>> > pg_last_xlog_replay_location >>> > >>> ----------------------------+-------------------------------+------------------------------ >>> > 2014-08-29 15:23:25.563766 | 177/2E80C9F8 | >>> 177/2E80C9F8 >>> > >>> > To me, this is sufficient proof that spiped is indeed talking to the >>> > machine I think it's talking to (also lsof reports the correct >>> hostname). >>> > >>> > I created another basebackup from the currently stuck postgres intance >>> on >>> > another machine and I also get this error: >>> > >>> > 2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream: >>> > ERROR: requested starting point 177/2D000000 is ahead of the WAL flush >>> > position of this server 174/B76D16A8 >>> >>> Uh. this indicates that the machine you're talking to is *not* one of >>> the above as it has a flush position of '174/B76D16A8' - not something >>> that's really possible when the node actually is at '177/2E80C9F8'. >>> >>> Could you run, on the standby that's having problems, the following >>> command: >>> psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c >>> 'IDENTIFY_SYSTEM;' >>> >>> Greetings, >>> >>> Andres Freund >>> >>> -- >>> Andres Freund http://www.2ndQuadrant.com/ >>> PostgreSQL Development, 24x7 Support, Training & Services >>> >> >> RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing >> something else wrong? Sorry :( >> >> First, I apologize for the misleading information, but when I made >> another basebackup and tried to use it, I configured the machine to cascade >> from the stuck replica, *not* from the spiped endpoint. When I properly >> connected it to the spiped endpoint it synced up fine, giving this log line: >> >> 2014-08-29 16:16:21 PDT LOG: started streaming WAL from primary at >> 177/4F000000 on timeline 1 >> >> The command as you gave reported a syntax error as is, but I googled a >> little bit and run this one: >> >> psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX >> password=XXX' -c 'IDENTIFY_SYSTEM;' >> >> And it gave me this output: >> >> systemid | timeline | xlogpos >> ---------------------+----------+-------------- >> 5964163898407843711 | 1 | 177/53091990 >> > > I I restarted the server and it promptly started fetching WAL archives from our backups [1]. I am still at a loss as to why the binary replication couldn't reconnect and why Postgres didn't fall back to fetching WAL archives with its restore_command. [1] http://pastebin.com/ZxM1LsNw