Auch! It is wrong. Not decrement keep, decrement segno.
2017年1月17日(火) 19:37 Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp>: > Hello. I added pgsql-hackers. > > This occurs also on git master and back to 9.4. > > At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnel...@dyn.com> > wrote in < > cacjqam1ydczcd5docp+y5hkwto1zegw+mj8uk7avqctbgjo...@mail.gmail.com> > > On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnel...@dyn.com> > wrote: > > > First, the postgresql configuration differs only minimally from the > stock > > > config: > > > > > > Assume wal_keep_segments = 0. > > > Assume the use of physical replication slots. > > > Assume one master, one standby. > > > > > > Lastly, we have observed the behavior "in the wild" at least twice and > in > > > the lab a dozen or so times. > > > > > > EXAMPLE #1 (logs are from the replica): > > > > > > user=,db=,app=,client= DEBUG: creating and filling new WAL file > > > user=,db=,app=,client= DEBUG: done creating and filling new WAL file > > > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush > 6/8A000000 > > > apply 5/748425A0 > > > user=,db=,app=,client= DEBUG: sending write 6/8B000000 flush > 6/8B000000 > > > apply 5/74843020 > > > <control-c here> > > > user=,db=,app=,client= DEBUG: postmaster received signal 2 > > > user=,db=,app=,client= LOG: received fast shutdown request > > > user=,db=,app=,client= LOG: aborting any active transactions > > > > > > And, upon restart: > > > > > > user=,db=,app=,client= LOG: restartpoint starting: xlog > > > user=,db=,app=,client= DEBUG: updated min recovery point to > 6/67002390 on > > > timeline 1 > > > user=,db=,app=,client= DEBUG: performing replication slot checkpoint > > > user=,db=,app=,client= DEBUG: updated min recovery point to > 6/671768C0 on > > > timeline 1 > > > user=,db=,app=,client= CONTEXT: writing block 589 of relation > > > base/13294/16501 > > > user=,db=,app=,client= LOG: invalid magic number 0000 in log segment > > > 00000001000000060000008B, offset 0 > > > user=,db=,app=,client= DEBUG: switched WAL source from archive to > stream > > > after failure > > > user=,db=,app=,client= LOG: started streaming WAL from primary at > > > 6/8A000000 on timeline 1 > > > user=,db=,app=,client= FATAL: could not receive data from WAL stream: > > > ERROR: requested WAL segment 00000001000000060000008A has already been > > > removed > > I managed to reproduce this. A little tweak as the first patch > lets the standby to suicide as soon as walreceiver sees a > contrecord at the beginning of a segment. > > - M(aster): createdb as a master with wal_keep_segments = 0 > (default), min_log_messages = debug2 > - M: Create a physical repslot. > - S(tandby): Setup a standby database. > - S: Edit recovery.conf to use the replication slot above then > start it. > - S: touch /tmp/hoge > - M: Run pgbench ... > - S: After a while, the standby stops. > > LOG: #################### STOP THE SERVER > > - M: Stop pgbench. > - M: Do 'checkpoint;' twice. > - S: rm /tmp/hoge > - S: Fails to catch up with the following error. > > > FATAL: could not receive data from WAL stream: ERROR: requested WAL > segment 00000001000000000000002B has already been removed > > > This problem occurs when only the earlier parts in a continued > record is replicated then the segment is removed on the > master. In other words, the first half is only on standby, and > the second half is only on the master. > > I believe that a continuation record cannot be span over three or > more *segments* (is it right?), so kepping one spare segment > would be enough. The attached second patch does this. > > > Other possible measures might be, > > - Allowing switching wal source while reading a continuation > record. Currently ReadRecord assumes that a continuation record > can be read from single source. But this needs refactoring > involving xlog.c, xlogreader.c and relatives. > > - Delaying recycing a segment until the last partial record on it > completes. This seems doable in page-wise (coarse resolution) > but would cost additional reading of past xlog files (page > header of past pages is required). > > - Delaying write/flush feedback until the current record is > completed. walreceiver is not conscious of a WAL record and > this might break synchronous replication. > > Any thoughts? > > > ========================================= > > > A physical analysis shows that the WAL file 00000001000000060000008B is > > > 100% zeroes (ASCII NUL). > > I suppose it is on the standby so the segment file is the one > where the next transferred record will be written onto. > > > > The results of querying pg_replication_slots shows a restart_lsn that > > > matches ….6/8B. > > It is the beginning of the next record to be replicatd as > documentation. In other words, just after the last transferred > record (containing padding). > > > > Pg_controldata shows values like: > > > Minimum recovery ending location: 6/8Axxxxxx > > It is the beginning of the last applied record. > > > > How can the master show a position that is greater than the minimum > > > recovery ending location? > > So it is natural that the former is larger than the latter. > > > > EXAMPLE #2: > > > > > > Minimum recovery ending location: 19DD/73FFFFE0 > > > Log segment 00000001000019DD00000073 was not available. > > > The restart LSN was 19DD/74000000. > > > The last few lines from pg_xlogdump 00000001000019DD00000073: > > > > > > > > > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn: > > > 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref > #0: > > > rel 1663/16403/150017028 blk 1832 > > > rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn: > > > 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref > #0: > > > rel 1663/16403/150017028 blk 11709 > > > > > > If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the > first > > > byte of the last record in this file, and the record length is 64 bytes > > > which places the first byte of the next record at: 16777184 (0xffffe0) > > > (logical position 0x73ffffe0: this jives with pg_controldata). > > Maybe right. pg_xlogdump skips partial records. > > > > However, there are only 32 bytes of file left: > > > 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184 > > > 16777216 - 16777184 -=> 32 > > > > > > Which means that the next record is in the WAL file > > > 00000001000019DD00000074. > > Maybe right. > > > > A possible theory: > > > > > > Let us assume PG has applied 100% of the data in a given WAL file, and > > > let’s assume (as in this case) that the WAL file is > > > 00000001000019DD00000073. When it starts up again, it uses the control > > > data to start and say “The next record is at 19DD/0x73ffffe0" which it > > > truncates to 0x73000000. However, PG has *also* already told the > master > > > that is has fully received, written, and flushed all of the data for > that > > > WAL file, so the master has 0x74000000 as the start position (and has > > > consequently removed the WAL file for 0x73). The relationship between > > > pg_controldata and pg_replication_slots.restart_lsn seem to be very > > > slightly (but importantly) at odds. > > > > > > Could it be this part of the code? > > No. the code does the right thing. The problem is that a > continuation record is assumed to be on the same wal source, that > is, archive/wal and streaming. But a continueation record is > distributed to two sources. > > > > From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming > > > (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a): > > > > > > 235 /* > > > 236 * We always start at the beginning of the segment. That > prevents > > > a broken > > > 237 * segment (i.e., with no records in the first half of a > segment) > > > from > > > 238 * being created by XLOG streaming, which might cause trouble > > > later on if > > > 239 * the segment is e.g archived. > > > 240 */ > > > 241 if (recptr % XLogSegSize != 0) > > > 242 recptr -= recptr % XLogSegSize; > > > 243 > > > > > > We start up with 19DD/0x73ffffe0 (but there would not be enough room in > > > that segment for any more records, so logically we'd have to go to > > > 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 > to > > > 0x73000000, which the master has already removed (and - technically - > we > > > don't actually need?). > > 0x73ffffe0 and 0x73000000 are on the same segment. Current > recovery mechanism requires reading the record starts from > 0x73ffffe0 and it is on the standby and it is read. > > > regards, > > -- > Kyotaro Horiguchi > NTT Open Source Software Center > > -- > Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs >