Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Kyotaro HORIGUCHI Tue, 17 Jan 2017 03:17:54 -0800

Auch! It is wrong.

Not decrement keep, decrement segno.


2017年1月17日(火) 19:37 Kyotaro HORIGUCHI <[email protected]>:

> Hello. I added pgsql-hackers.
>
> This occurs also on git master and back to 9.4.
>
> At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <[email protected]>
> wrote in <
> cacjqam1ydczcd5docp+y5hkwto1zegw+mj8uk7avqctbgjo...@mail.gmail.com>
> > On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <[email protected]>
> wrote:
> > > First, the postgresql configuration differs only minimally from the
> stock
> > > config:
> > >
> > > Assume wal_keep_segments = 0.
> > > Assume the use of physical replication slots.
> > > Assume one master, one standby.
> > >
> > > Lastly, we have observed the behavior "in the wild" at least twice and
> in
> > > the lab a dozen or so times.
> > >
> > > EXAMPLE #1 (logs are from the replica):
> > >
> > > user=,db=,app=,client= DEBUG:  creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG:  done creating and filling new WAL file
> > > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush
> 6/8A000000
> > > apply 5/748425A0
> > > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush
> 6/8B000000
> > > apply 5/74843020
> > > <control-c here>
> > > user=,db=,app=,client= DEBUG:  postmaster received signal 2
> > > user=,db=,app=,client= LOG:  received fast shutdown request
> > > user=,db=,app=,client= LOG:  aborting any active transactions
> > >
> > > And, upon restart:
> > >
> > > user=,db=,app=,client= LOG:  restartpoint starting: xlog
> > > user=,db=,app=,client= DEBUG:  updated min recovery point to
> 6/67002390 on
> > > timeline 1
> > > user=,db=,app=,client= DEBUG:  performing replication slot checkpoint
> > > user=,db=,app=,client= DEBUG:  updated min recovery point to
> 6/671768C0 on
> > > timeline 1
> > > user=,db=,app=,client= CONTEXT:  writing block 589 of relation
> > > base/13294/16501
> > > user=,db=,app=,client= LOG:  invalid magic number 0000 in log segment
> > > 00000001000000060000008B, offset 0
> > > user=,db=,app=,client= DEBUG:  switched WAL source from archive to
> stream
> > > after failure
> > > user=,db=,app=,client= LOG:  started streaming WAL from primary at
> > > 6/8A000000 on timeline 1
> > > user=,db=,app=,client= FATAL:  could not receive data from WAL stream:
> > > ERROR:  requested WAL segment 00000001000000060000008A has already been
> > > removed
>
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
>
> - M(aster): createdb as a master with wal_keep_segments = 0
>             (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
>      start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
>   > LOG:  #################### STOP THE SERVER
>
> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
>
>   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> segment 00000001000000000000002B has already been removed
>
>
> This problem occurs when only the earlier parts in a continued
> record is replicated then the segment is removed on the
> master. In other words, the first half is only on standby, and
> the second half is only on the master.
>
> I believe that a continuation record cannot be span over three or
> more *segments* (is it right?), so kepping one spare segment
> would be enough. The attached second patch does this.
>
>
> Other possible measures might be,
>
> - Allowing switching wal source while reading a continuation
>   record. Currently ReadRecord assumes that a continuation record
>   can be read from single source. But this needs refactoring
>   involving xlog.c, xlogreader.c and relatives.
>
> - Delaying recycing a segment until the last partial record on it
>   completes. This seems doable in page-wise (coarse resolution)
>   but would cost additional reading of past xlog files (page
>   header of past pages is required).
>
> - Delaying write/flush feedback until the current record is
>   completed. walreceiver is not conscious of a WAL record and
>   this might break synchronous replication.
>
> Any thoughts?
>
>
> =========================================
> > > A physical analysis shows that the WAL file 00000001000000060000008B is
> > > 100% zeroes (ASCII NUL).
>
> I suppose it is on the standby so the segment file is the one
> where the next transferred record will be written onto.
>
> > > The results of querying pg_replication_slots shows a restart_lsn that
> > > matches ….6/8B.
>
> It is the beginning of the next record to be replicatd as
> documentation. In other words, just after the last transferred
> record (containing padding).
>
> > > Pg_controldata shows values like:
> > > Minimum recovery ending location:     6/8Axxxxxx
>
> It is the beginning of the last applied record.
>
> > > How can the master show a position that is greater than the minimum
> > > recovery ending location?
>
> So it is natural that the former is larger than the latter.
>
> > > EXAMPLE #2:
> > >
> > > Minimum recovery ending location:     19DD/73FFFFE0
> > > Log segment 00000001000019DD00000073 was not available.
> > > The restart LSN was 19DD/74000000.
> > > The last few lines from pg_xlogdump 00000001000019DD00000073:
> > >
> > >
> > > rmgr: Btree       len (rec/tot):      2/    64, tx:      77257, lsn:
> > > 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref
> #0:
> > > rel 1663/16403/150017028 blk 1832
> > > rmgr: Btree       len (rec/tot):      2/    64, tx:      77257, lsn:
> > > 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref
> #0:
> > > rel 1663/16403/150017028 blk 11709
> > >
> > > If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the
> first
> > > byte of the last record in this file, and the record length is 64 bytes
> > > which places the first byte of the next record at: 16777184 (0xffffe0)
> > > (logical position 0x73ffffe0: this jives with pg_controldata).
>
> Maybe right. pg_xlogdump skips partial records.
>
> > > However, there are only 32 bytes of file left:
> > > 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> > > 16777216 - 16777184 -=> 32
> > >
> > > Which means that the next record is in the WAL file
> > > 00000001000019DD00000074.
>
> Maybe right.
>
> > > A possible theory:
> > >
> > > Let us assume PG has applied 100% of the data in a given WAL file, and
> > > let’s assume (as in this case) that the WAL file is
> > > 00000001000019DD00000073.  When it starts up again, it uses the control
> > > data to start and say “The next record is at 19DD/0x73ffffe0"  which it
> > > truncates to 0x73000000.  However, PG has *also* already told the
> master
> > > that is has fully received, written, and flushed all of the data for
> that
> > > WAL file, so the master has 0x74000000 as the start position (and has
> > > consequently removed the WAL file for 0x73). The relationship between
> > > pg_controldata and pg_replication_slots.restart_lsn seem to be very
> > > slightly (but importantly) at odds.
> > >
> > > Could it be this part of the code?
>
> No. the code does the right thing. The problem is that a
> continuation record is assumed to be on the same wal source, that
> is, archive/wal and streaming. But a continueation record is
> distributed to two sources.
>
> > > From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> > > (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
> > >
> > > 235     /*
> > > 236      * We always start at the beginning of the segment. That
> prevents
> > > a broken
> > > 237      * segment (i.e., with no records in the first half of a
> segment)
> > > from
> > > 238      * being created by XLOG streaming, which might cause trouble
> > > later on if
> > > 239      * the segment is e.g archived.
> > > 240      */
> > > 241     if (recptr % XLogSegSize != 0)
> > > 242         recptr -= recptr % XLogSegSize;
> > > 243
> > >
> > > We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> > > that segment for any more records, so logically we'd have to go to
> > > 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0
> to
> > > 0x73000000, which the master has already removed (and - technically -
> we
> > > don't actually need?).
>
> 0x73ffffe0 and 0x73000000 are on the same segment. Current
> recovery mechanism requires reading the record starts from
> 0x73ffffe0 and it is on the standby and it is read.
>
>
> regards,
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
> --
> Sent via pgsql-bugs mailing list ([email protected])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Reply via email to