Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

Andres Freund Sat, 15 Feb 2014 03:46:32 -0800

On 2014-02-14 22:30:45 -0500, Tom Lane wrote:
> Andres Freund <[email protected]> writes:
> > On 2014-02-14 20:46:01 +0000, Greg Stark wrote:
> >> Going over this I think this is still a potential issue:
> >> On 31 Jan 2014 15:56, "Andres Freund" <[email protected]> wrote:
> >>> I am not sure that explains the issue, but I think the redo action for
> >>> truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
> >>> do a smgrtruncate() (and then mdtruncate) which will iterate over the
> >>> segments starting at 0 till mdnblocks()/segment_size and *truncate* but
> >>> not delete individual segment files that are not needed anymore, right?
> >>> If we crash in the midst of that a new mdtruncate() will be issued, but
> >>> it will get a shorter value back from mdnblocks().


> We could probably fix things so it deleted backwards; it'd be a tad
> tedious because the list structure isn't organized that way, but we
> could do it.

We could just make the list a doubly linked one, that'd make it simple.

> Not sure if that's good enough though.  If you don't
> want to assume the filesystem metadata is coherent after a crash,
> we might have nonzero-size segments after zero-size ones, even if
> the truncate calls had been issued in the right order.

I don't think that can actually happen on any realistic/interesting
FS. Metadata updates better be journaled, so while they might not
persist because the journal wasn't flushed, they should be applied in a
sane order after a crash.
But nonetheless I am not sure we want to rely on that.

> Another possibility is to keep opening and truncating files until
> we don't find the next segment in sequence, looking directly at the
> filesystem not at the mdfd chain.  I don't think this would be
> appropriate in normal operation, but we could do it if InRecovery
> (and maybe even only if we don't think the database is consistent?)

Yes, I was thinking of simply having a mdnblocks() variant that looks
for the last existing file, disregarding the size. But looking around,
it seems mdunlinkfork() has a similar issue, and I don't see how such a
trick could be applied there :(

I guess the theoretically correct thing would be to make all WAL records
about truncation and unlinking contain the current size of the relation,
but especially with deletions and forks that will probably turn out to
be annoying to do.

Greetings,

Andres Freund

-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

Reply via email to