One thing I keep coming back to is a bad ran chip setting a bit in the block number. But I just can't seem to get it to add up. The difference is not a power of two, it had happened on two different machines, and we don't see other weirdness on the machine. It seems like a strange coincidence it would happen to the same variable twice and not to other variables.
Unless there's some unrelated code writing through a wild pointer, possibly to a stack allocated object that just happens to often be that variable? -- greg On 31 Jan 2014 20:21, "Tom Lane" <t...@sss.pgh.pa.us> wrote: > Greg Stark <st...@mit.edu> writes: > > So just to summarize, this xlog record: > > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, > > info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid > > 3634978/282 > > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, > > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 > > blk:3634978 hole_off/len:1240/2072 > > > Appears to have been written to [ block 7141472 ] > > I've been staring at the code for a bit trying to guess how that could > have happened. Since the WAL record has a backup block, btree_xlog_insert > would have passed control to RestoreBackupBlock, which would call > XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint > about writing past the end of the relation. Now, you can imagine some > very low-level error causing a write to go to the wrong page due to a seek > problem or some such, but it's hard to credit that that would've resulted > in creation of all the intervening segment files. Some level of our code > had to have thought it was being told to extend the relation. > > However, on closer inspection I was a bit surprised to realize that there > are two possible candidates for doing that! XLogReadBufferExtended will > extend the relation, a block at a time, if told to write a page past > the current nominal EOF. And in md.c, _mdfd_getseg will *also* extend > the relation if we're InRecovery, even though it normally would not do > so when called from mdwrite(). > > Given the behavior in XLogReadBufferExtended, I rather think that the > InRecovery special case in _mdfd_getseg is dead code and should be > removed. But for the purpose at hand, it's more interesting to try to > confirm which of these code levels did the extension. I notice that > _mdfd_getseg only bothers to write the last physical page of each segment, > whereas XLogReadBufferExtended knows nothing of segments and will > ploddingly write every page. So on a filesystem that supports "holes" > in files, I'd expect that the added segments would be fully allocated > if XLogReadBufferExtended did the deed, but they'd be quite small if > _mdfd_getseg did so. The du results you started with suggest that the > former is the case, but could you verify that the filesystem this is > on supports holes and that du will report only the actually allocated > space when there's a hole? > > Assuming that the extension was done in XLogReadBufferExtended, we are > forced to the conclusion that XLogReadBufferExtended was passed a bad > block number (viz 7141472); and it's pretty hard to see how that could > happen. RestoreBackupBlock is just passing the value it got out of the > WAL record. I thought about the idea that it was wrong about exactly > where the BkpBlock struct was in the record, but that would presumably > lead to garbage relnode and fork numbers not just a bad block number. > > So I'm still baffled ... > > regards, tom lane >