On Sat, Feb 15, 2014 at 11:45 AM, Andres Freund wrote:
> I guess the theoretically correct thing would be to make all WAL records
> about truncation and unlinking contain the current size of the relation,
> but especially with deletions and forks that will probably turn out to
> be annoying to do.
On 2014-02-14 22:30:45 -0500, Tom Lane wrote:
> Andres Freund writes:
> > On 2014-02-14 20:46:01 +, Greg Stark wrote:
> >> Going over this I think this is still a potential issue:
> >> On 31 Jan 2014 15:56, "Andres Freund" wrote:
> >>> I am not sure that explains the issue, but I think the re
Andres Freund writes:
> On 2014-02-14 20:46:01 +, Greg Stark wrote:
>> Going over this I think this is still a potential issue:
>> On 31 Jan 2014 15:56, "Andres Freund" wrote:
>>> I am not sure that explains the issue, but I think the redo action for
>>> truncation is not safe across crashes.
On 2014-02-14 20:46:01 +, Greg Stark wrote:
> Going over this I think this is still a potential issue:
>
> On 31 Jan 2014 15:56, "Andres Freund" wrote:
>
> >
> > I am not sure that explains the issue, but I think the redo action for
> > truncation is not safe across crashes. A XLOG_SMGR_TRU
Going over this I think this is still a potential issue:
On 31 Jan 2014 15:56, "Andres Freund" wrote:
>
> I am not sure that explains the issue, but I think the redo action for
> truncation is not safe across crashes. A XLOG_SMGR_TRUNCATE will just
> do a smgrtruncate() (and then mdtruncate) wh
Greg Stark writes:
> On Thu, Feb 13, 2014 at 7:52 PM, Tom Lane wrote:
>> That's what's bothering me, too. On the other hand, if we can't think of
>> a scenario where it'd be necessary to replay the high-offset update, then
>> I'm disinclined to mess with the code further.
> And the whole point
On Thu, Feb 13, 2014 at 7:52 PM, Tom Lane wrote:
>> The scenario I could come up with that didn't require a broken base backup
>> was that there was an earlier truncate or vacuum. So the sequence is high
>> offset reference, truncate, growth, crash. All possibly on a single
>> database.
>
> That's
Greg Stark writes:
>> I think what you're arguing is that we should see WAL records filling the
>> rest of segment 1 before we see any references to segment 2, but if that's
>> the case then how did we get into the situation you reported? Or is it
>> just that it was a broken base backup to start
> I think what you're arguing is that we should see WAL records filling the
> rest of segment 1 before we see any references to segment 2, but if that's
> the case then how did we get into the situation you reported? Or is it
> just that it was a broken base backup to start with?
The scenario I c
Hi all,
On 02/12/2014 08:27 PM, Greg Stark wrote:
On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane wrote:
Greg Stark writes:
For what it's worth I've confirmed the bug in wal-e caused the initial
problem.
Huh? Bug in wal-e? What bug?
WAL-E actually didn't restore a whole 1GB file due to a tra
Greg Stark writes:
> On Wed, Feb 12, 2014 at 8:28 PM, Tom Lane wrote:
>> Oh, wait a minute. It's not just a matter of whether we find the right
>> block: we also have to consider whether XLogReadBufferExtended will
>> apply the right "mode" behavior. Currently, it supposes that all pages
>> pas
On Wed, Feb 12, 2014 at 8:28 PM, Tom Lane wrote:
> Oh, wait a minute. It's not just a matter of whether we find the right
> block: we also have to consider whether XLogReadBufferExtended will
> apply the right "mode" behavior. Currently, it supposes that all pages
> past the initially observed E
I wrote:
> What I think we probably want to do is forcibly cause the target page
> to exist, using a P_NEW loop like what I committed, and then decide
> on the basis of whether it's all-zeroes whether to consider it invalid
> or not. This seems sane on the grounds that it's just the extension
> to
I wrote:
> Greg Stark writes:
>> WAL-E actually didn't restore a whole 1GB file due to a transient S3
>> problem, in fact a bunch of them.
> Hah. Okay, I think we can write this issue off as closed then.
Oh, wait a minute. It's not just a matter of whether we find the right
block: we also have
Greg Stark writes:
> On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane wrote:
>> Greg Stark writes:
>>> This does possibly allocate an extra block past the target block. I'm
>>> not sure how surprising that would be for the rest of the code.
>> Should be fine; we could end up with an extra block after
On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane wrote:
> Greg Stark writes:
>> On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane wrote:
>>> How about the attached instead?
>
>> This does possibly allocate an extra block past the target block. I'm
>> not sure how surprising that would be for the rest of the co
Greg Stark writes:
> On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane wrote:
>> How about the attached instead?
> This does possibly allocate an extra block past the target block. I'm
> not sure how surprising that would be for the rest of the code.
Should be fine; we could end up with an extra block
I wrote:
> Greg Stark writes:
>> (Or maybe the hot backup
>> process could just catch the files in this state if a table is rapidly
>> growing and it doesn't take care to avoid picking up new files that
>> appear after it starts?)
> That's a possible explanation I guess, but it doesn't seem terri
On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane wrote:
> How about the attached instead?
This does possibly allocate an extra block past the target block. I'm
not sure how surprising that would be for the rest of the code.
For what it's worth I've confirmed the bug in wal-e caused the initial
problem.
Greg Stark writes:
> So I think I've come up with a scenario that could cause this. I don't
> think it's exactly what happened here but maybe something analogous
> happened with our base backup restore.
I agree it seems like a good idea for XLogReadBufferExtended to defend
itself against successi
Greg Stark writes:
> So here's my attempt to rewrite this logic. I ended up refactoring a
> bit because I found it unnecessarily confusing having the mode
> branches in several places. I think it's much clearer just having two
> separate pieces of logic for RBM_NEW and the extension cases since al
So here's my attempt to rewrite this logic. I ended up refactoring a
bit because I found it unnecessarily confusing having the mode
branches in several places. I think it's much clearer just having two
separate pieces of logic for RBM_NEW and the extension cases since all
they have in common is the
So I think I've come up with a scenario that could cause this. I don't
think it's exactly what happened here but maybe something analogous
happened with our base backup restore.
On the primary you extend a table a bunch, including adding new
segments, but crash before committing (or checkpointing)
On Sun, Feb 9, 2014 at 2:54 PM, Greg Stark wrote:
> Bad block's page header -- this is in the 56'th relation segment:
>
> =# select
> (page_header(E'\\x2005583b05aa050028001805002004201098e00f2090e00f088d24061885e00f')).*;
> lsn | tli | flags | lower | u
On Thu, Feb 6, 2014 at 11:41 PM, Greg Stark wrote:
>
> That doesn't explain the other instance or the other copies of this
> database. I think the most productive thing I can do is switch my
> attention to the other database to see if it really looks like the
> same problem.
So here's an instance
On 2014-02-06 20:06:03 -0500, Tom Lane wrote:
> Andres Freund writes:
> > That reminds me, not that I directly see how it could be responsible,
> > there's still 20131029011623.gj20...@awork2.anarazel.de ff. around. I
> > don't think we came to a agreement in that thread how to fix the
> > problem
Andres Freund writes:
> That reminds me, not that I directly see how it could be responsible,
> there's still 20131029011623.gj20...@awork2.anarazel.de ff. around. I
> don't think we came to a agreement in that thread how to fix the
> problem.
Hm, yeah. I'm not sure I believe Heikki's argument t
On 2014-02-06 18:42:05 -0500, Tom Lane wrote:
> Greg Stark writes:
> > On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund
> > wrote:
> >> That's not necessarily true. If e.g. the buffer mapping would change
> >> racily, the result write from the bgwriter could very well end up
> >> increasing the fi
Greg Stark writes:
> On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund wrote:
>> That's not necessarily true. If e.g. the buffer mapping would change
>> racily, the result write from the bgwriter could very well end up
>> increasing the file size, leaving a hole inbetween its write and the
>> origin
On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund wrote:
>
> That's not necessarily true. If e.g. the buffer mapping would change
> racily, the result write from the bgwriter could very well end up
> increasing the file size, leaving a hole inbetween its write and the
> original size.
a) the segment
On 2014-02-06 23:41:19 +0100, Greg Stark wrote:
> The problem with the bgwriter being at fault is that from what I can
> see the bgwriter will never extend a file. That means the xlog
> recovery code must have done it. That means even if the bgwriter came
> along and looked at the buffer we just re
On Thu, Feb 6, 2014 at 10:48 PM, Tom Lane wrote:
> I had noticed that the WAL records that were mis-replayed seemed to
> be bunched pretty close together (two of them even adjacent). Could
> you confirm that? If so, it seems like we're looking for some condition
> that makes mis-replay fairly pr
Greg Stark writes:
> Both the primary and the standby were 9.1.11 from the get-go. The
> database the primary was forked off of was 9.1.10 but as far as I can
> tell the primary in the current pair has no problems.
> What's worse is we created a new standby from the same base backup and
> replaye
On Mon, Feb 3, 2014 at 12:02 AM, Tom Lane wrote:
> What version were you running before 9.1.11 exactly? I took a look
> through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any
> changes that seemed even vaguely related to this. There are some
> changes in known-transaction tracking,
Greg Stark writes:
> On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane wrote:
>> Can we see the associated WAL records (ie, the ones matching the LSNs
>> in the last blocks of these files)?
> Sorry, I've lost track of what information I already shared or didn't,
Hm. So one of these is a heap update, no
On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane wrote:
> Greg Stark writes:
>> The relfilenodes that have nul blocks before the last block are:
>
> Can we see the associated WAL records (ie, the ones matching the LSNs
> in the last blocks of these files)?
Sorry, I've lost track of what information I al
Greg Stark writes:
> The relfilenodes that have nul blocks before the last block are:
Can we see the associated WAL records (ie, the ones matching the LSNs
in the last blocks of these files)?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postg
Hm, I'm not entirely convinced those are erroneous replays to wrong
blocks. They don't look right but there are no blocks of NULs
preceding them. So if they're wrong then they only extended the
relations by a single block.
The relfilenodes that have nul blocks before the last block are:
relfilen
I've poked at this a bit more. There are at least 10 relations where
the last block doesn't match the block mentioned in the xlog record
that its LSN indicates. At least it looks like from the info xlogdump
prints.
Including two blocks where the "correct" block has the same LSN which
maybe means t
The plot thickens...
Looking at the next relation I see a similar symptom of a single valid
block at the end of several segments of nuls. This relation is also a
btree on the same table and has a header in the near vicinity of the
xlog:
d9de7pcqls4ib6=# select
(page_header(get_raw_page('event_dat
On Fri, Jan 31, 2014 at 8:21 PM, Tom Lane wrote:
> So on a filesystem that supports "holes"
> in files, I'd expect that the added segments would be fully allocated
> if XLogReadBufferExtended did the deed, but they'd be quite small if
> _mdfd_getseg did so. The du results you started with sugges
Josh Berkus writes:
> FWIW, we've periodically seen reports from our clients of replica
> databases being slightly larger than the master. Nothing reproducable
> or as severe as Greg's issue, or we'd have reported it. But this could
> be a more widespread issue, just that it affects most users i
On Fri, Jan 31, 2014 at 10:11 PM, Tom Lane wrote:
> Yeah, I'd been wondering if the WAL record somehow got corrupted while
> in memory (presumably after being CRC-checked). It's a bit hard to see
> how though.
One thing I mentioned early on but bears repeating is that this
instance is 9.1.11.
A
On 01/31/2014 01:11 PM, Tom Lane wrote:
> Greg Stark writes:
>> One thing I keep coming back to is a bad ran chip setting a bit in the
>> block number. But I just can't seem to get it to add up. The difference is
>> not a power of two, it had happened on two different machines, and we don't
>> see
Greg Stark writes:
> One thing I keep coming back to is a bad ran chip setting a bit in the
> block number. But I just can't seem to get it to add up. The difference is
> not a power of two, it had happened on two different machines, and we don't
> see other weirdness on the machine. It seems like
One thing I keep coming back to is a bad ran chip setting a bit in the
block number. But I just can't seem to get it to add up. The difference is
not a power of two, it had happened on two different machines, and we don't
see other weirdness on the machine. It seems like a strange coincidence it
wo
Greg Stark writes:
> So just to summarize, this xlog record:
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
> 3634978/282
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8,
So just to summarize, this xlog record:
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
3634978/282
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] bkpblock[1]: s
On Fri, Jan 31, 2014 at 3:41 PM, Tom Lane wrote:
>> 400 * 400 * 400 / 2000 * 54 + 1F0C / 2000
>> 11073632
Ooops, it's reading 54 in hex there.
> # select ((2^30) * 54.0 + 'x1F0C'::bit(32)::int) / 8192;
> ?column?
> --
> 7141472
ibase=16
400 * 400 * 400 / 2000 * 36 + 1F0C /
On 2014-01-31 10:33:16 -0500, Tom Lane wrote:
> Andres Freund writes:
> > It's interesting that the smgr gets this wrong then (as also evidenced
> > by the fact that relation_size does as well). Could you please do a ls
> > -l path/to/relfilenode*?
>
> IIRC, smgrnblocks will stop as soon as it fi
Greg Stark writes:
> Sorry guys. I transposed two numbers when looking up the relation.
> "data_pk" wasn't the right index.
> =# select (page_header(get_raw_page('index_data_id', 'main', 3020854))).* ;
> lsn | tli | flags | lower | upper | special | pagesize |
> version | prune_xid
> --
Sorry guys. I transposed two numbers when looking up the relation.
"data_pk" wasn't the right index.
=# select (page_header(get_raw_page('index_data_id', 'main', 3020854))).* ;
lsn | tli | flags | lower | upper | special | pagesize |
version | prune_xid
--+-+---+-
Greg Stark writes:
> On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund wrote:
>> Isn't the page 3634978?
> The page in the record is.
> But the page on disk is in the 54th segment at offset 1F0C
> So unless my arithmetic is wrong:
> bc -l
> ibase=16
> 400 * 400 * 400 / 2000 * 54 + 1F0C /
On 2014-01-31 10:33:16 -0500, Tom Lane wrote:
> Andres Freund writes:
> > It's interesting that the smgr gets this wrong then (as also evidenced
> > by the fact that relation_size does as well). Could you please do a ls
> > -l path/to/relfilenode*?
>
> IIRC, smgrnblocks will stop as soon as it fi
Andres Freund writes:
> It's interesting that the smgr gets this wrong then (as also evidenced
> by the fact that relation_size does as well). Could you please do a ls
> -l path/to/relfilenode*?
IIRC, smgrnblocks will stop as soon as it finds a segment that is not
1GB in size. Could you check th
On 2014-01-31 15:21:35 +, Greg Stark wrote:
> On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund wrote:
> >> =# select get_raw_page('data_pkey', 'main', 11073632) ;
> >> ERROR: block number 11073632 is out of range for relation "data_pkey"
> >
> > Isn't the page 3634978?
>
> The page in the reco
On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund wrote:
>> =# select get_raw_page('data_pkey', 'main', 11073632) ;
>> ERROR: block number 11073632 is out of range for relation "data_pkey"
>
> Isn't the page 3634978?
The page in the record is.
But the page on disk is in the 54th segment at offset
On 2014-01-31 15:15:24 +, Greg Stark wrote:
> On Fri, Jan 31, 2014 at 3:08 PM, Andres Freund wrote:
>
> > It points to the end of the record (i.e. the beginning of the next). It
> > needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush
> > enough.
>
> Ah, in which case the r
On Fri, Jan 31, 2014 at 3:08 PM, Andres Freund wrote:
> It points to the end of the record (i.e. the beginning of the next). It
> needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush
> enough.
Ah, in which case the relevant record is:
[cur:EA1/637140, xid:1418089147, rmid:11(Bt
On 2014-01-31 14:59:21 +, Greg Stark wrote:
> On Fri, Jan 31, 2014 at 2:39 PM, Greg Stark wrote:
> > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> > blk:3634978 hole_off/len:1240/2072
> > [cur:EA1/6389
On Fri, Jan 31, 2014 at 2:39 PM, Greg Stark wrote:
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> blk:3634978 hole_off/len:1240/2072
> [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
>
On 2014-01-31 14:39:47 +, Greg Stark wrote:
> 1261982.53 is entirely nuls. I think that's true for most if not all
> of the intervening files, still investigating.
>
> The 54th segment is nul up to offset 1f0c after which it has valid
> looking blocks:
It'd be interesting to dump the page
1261982.53 is entirely nuls. I think that's true for most if not all
of the intervening files, still investigating.
The 54th segment is nul up to offset 1f0c after which it has valid
looking blocks:
# hexdump 1261982.54 | head -100
000
*
1f0c 0e
On 2014-01-31 11:46:09 +, Greg Stark wrote:
> On Fri, Jan 31, 2014 at 11:26 AM, Andres Freund
> wrote:
> > The slightly more likely explanation for transient errors is that you
> > hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had
> > only taken effect if HS has already
On Fri, Jan 31, 2014 at 11:26 AM, Andres Freund wrote:
> The slightly more likely explanation for transient errors is that you
> hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had
> only taken effect if HS has already assembled a snapshot, which can make
> such an error vanish
On 2014-01-31 11:09:14 +, Greg Stark wrote:
> On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund wrote:
> >
> >> We're also seeing log entries about "wal contains reference to invalid
> >> pages" but these errors seem only vaguely correlated. Sometimes we get
> >> the errors but the tables don't g
On 2014-01-31 11:09:14 +, Greg Stark wrote:
> On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund wrote:
> >
> >> We're also seeing log entries about "wal contains reference to invalid
> >> pages" but these errors seem only vaguely correlated. Sometimes we get
> >> the errors but the tables don't g
On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund wrote:
>
>> We're also seeing log entries about "wal contains reference to invalid
>> pages" but these errors seem only vaguely correlated. Sometimes we get
>> the errors but the tables don't grow noticeably and sometimes we don't
>> get the errors an
On Sun, Jan 26, 2014 at 9:45 AM, Andres Freund wrote:
> Hi,
>
> On 2014-01-24 19:23:28 -0500, Greg Stark wrote:
>> Since the point release we've run into a number of databases that when
>> we restore from a base backup end up being larger than the primary
>> database was. Sometimes by a large fact
Hi,
On 2014-01-24 19:23:28 -0500, Greg Stark wrote:
> Since the point release we've run into a number of databases that when
> we restore from a base backup end up being larger than the primary
> database was. Sometimes by a large factor. The data below is from
> 9.1.11 (both primary and standby)
70 matches
Mail list logo