Re: Corruption during WAL replay

Andres Freund Thu, 24 Mar 2022 22:35:03 -0700

Hi,

On 2022-03-25 01:23:00 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > I do see that the LSN that ends up on the page is the same across a few runs
> > of the test on serinus. Which presumably differs between different
> > animals. Surprised that it's this predictable - but I guess the run is short
> > enough that there's no variation due to autovacuum, checkpoints etc.
> 
> Uh-huh.  I'm not surprised that it's repeatable on a given animal.
> What remains to be explained:
> 
> 1. Why'd it start failing now?  I'm guessing that ce95c5437 *was* the
> culprit after all, by slightly changing the amount of catalog data
> written during initdb, and thus moving the initial LSN.


Yep, verified that (see mail I just sent).


> 2. Why just these two animals?  If initial LSN is the critical thing,
> then the results of "locale -a" would affect it, so platform
> dependence is hardly surprising ... but I'd have thought that all
> the animals on that host would use the same initial set of
> collations.

I think it's the animal's name that makes the difference, due to the
tablespace path lenght thing. And while I was confused for a second by

petalura
pogona
serinus
dragonet

failing, despite different name lengths, it still makes sense: We MAXALIGN the
start of records. Which explains why flaviventris didn't fail the same way.


> As for a fix, would damaging more of the page help?  I guess
> it'd just move around the one-in-64K chance of failure.

As I wrote in the other email, I think spreading the changes out wider might
help. But it's still not great. However:

> Maybe we have to intentionally corrupt (e.g. invert) the
> checksum field specifically.

seems like it'd do the trick? Even a single bit change of the checksum ought
to do, as long as we don't set it to 0.

Greetings,

Andres Freund

Re: Corruption during WAL replay

Reply via email to