Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

Greg Stark Fri, 31 Jan 2014 12:29:23 -0800

One thing I keep coming back to is a bad ran chip setting a bit in the
block number. But I just can't seem to get it to add up. The difference is
not a power of two, it had happened on two different machines, and we don't
see other weirdness on the machine. It seems like a strange coincidence it
would happen to the same variable twice and not to other variables.


Unless there's some unrelated code writing through a wild pointer, possibly
to a stack allocated object that just happens to often be that variable?

-- 
greg
On 31 Jan 2014 20:21, "Tom Lane" <t...@sss.pgh.pa.us> wrote:

> Greg Stark <st...@mit.edu> writes:
> > So just to summarize, this xlog record:
> > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> > info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
> > 3634978/282
> > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> > blk:3634978 hole_off/len:1240/2072
>
> > Appears to have been written to [ block 7141472 ]
>
> I've been staring at the code for a bit trying to guess how that could
> have happened.  Since the WAL record has a backup block, btree_xlog_insert
> would have passed control to RestoreBackupBlock, which would call
> XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint
> about writing past the end of the relation.  Now, you can imagine some
> very low-level error causing a write to go to the wrong page due to a seek
> problem or some such, but it's hard to credit that that would've resulted
> in creation of all the intervening segment files.  Some level of our code
> had to have thought it was being told to extend the relation.
>
> However, on closer inspection I was a bit surprised to realize that there
> are two possible candidates for doing that!  XLogReadBufferExtended will
> extend the relation, a block at a time, if told to write a page past
> the current nominal EOF.  And in md.c, _mdfd_getseg will *also* extend
> the relation if we're InRecovery, even though it normally would not do
> so when called from mdwrite().
>
> Given the behavior in XLogReadBufferExtended, I rather think that the
> InRecovery special case in _mdfd_getseg is dead code and should be
> removed.  But for the purpose at hand, it's more interesting to try to
> confirm which of these code levels did the extension.  I notice that
> _mdfd_getseg only bothers to write the last physical page of each segment,
> whereas XLogReadBufferExtended knows nothing of segments and will
> ploddingly write every page.  So on a filesystem that supports "holes"
> in files, I'd expect that the added segments would be fully allocated
> if XLogReadBufferExtended did the deed, but they'd be quite small if
> _mdfd_getseg did so.  The du results you started with suggest that the
> former is the case, but could you verify that the filesystem this is
> on supports holes and that du will report only the actually allocated
> space when there's a hole?
>
> Assuming that the extension was done in XLogReadBufferExtended, we are
> forced to the conclusion that XLogReadBufferExtended was passed a bad
> block number (viz 7141472); and it's pretty hard to see how that could
> happen.  RestoreBackupBlock is just passing the value it got out of the
> WAL record.  I thought about the idea that it was wrong about exactly
> where the BkpBlock struct was in the record, but that would presumably
> lead to garbage relnode and fork numbers not just a bad block number.
>
> So I'm still baffled ...
>
>                         regards, tom lane
>

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

Reply via email to