Hi, On 2022-03-25 01:23:00 -0400, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > I do see that the LSN that ends up on the page is the same across a few runs > > of the test on serinus. Which presumably differs between different > > animals. Surprised that it's this predictable - but I guess the run is short > > enough that there's no variation due to autovacuum, checkpoints etc. > > Uh-huh. I'm not surprised that it's repeatable on a given animal. > What remains to be explained: > > 1. Why'd it start failing now? I'm guessing that ce95c5437 *was* the > culprit after all, by slightly changing the amount of catalog data > written during initdb, and thus moving the initial LSN.
Yep, verified that (see mail I just sent). > 2. Why just these two animals? If initial LSN is the critical thing, > then the results of "locale -a" would affect it, so platform > dependence is hardly surprising ... but I'd have thought that all > the animals on that host would use the same initial set of > collations. I think it's the animal's name that makes the difference, due to the tablespace path lenght thing. And while I was confused for a second by petalura pogona serinus dragonet failing, despite different name lengths, it still makes sense: We MAXALIGN the start of records. Which explains why flaviventris didn't fail the same way. > As for a fix, would damaging more of the page help? I guess > it'd just move around the one-in-64K chance of failure. As I wrote in the other email, I think spreading the changes out wider might help. But it's still not great. However: > Maybe we have to intentionally corrupt (e.g. invert) the > checksum field specifically. seems like it'd do the trick? Even a single bit change of the checksum ought to do, as long as we don't set it to 0. Greetings, Andres Freund