On Sun, May 7, 2023 at 12:29 AM Evgeny Morozov <postgres...@realityexists.net> wrote: > On 6/05/2023 12:34 pm, Thomas Munro wrote: > > So it does indeed look like something unknown has replaced 32KB of > > data with 32KB of zeroes underneath us. Are there more non-empty > > files that are all-zeroes? Something like this might find them: > > > > for F in base/1414389/* > > do > > if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null > > then > > echo $F > > fi > > done > > Yes, a total of 309 files are all-zeroes (and 52 files are not). > > I also checked the other DB that reports the same "unexpected zero page > at block 0" error, "test_behavior_638186280406544656" (OID 1414967) - > similar story there. I uploaded the lists of zeroed and non-zeroed files > and the ls -la output for both as > https://objective.realityexists.net/temp/pgstuff3.zip > > I then searched recursively such all-zeroes files in $PGDATA/base and > did not find any outside of those two directories (base/1414389 and > base/1414967). None in $PGDATA/global, either.
So "diff -u zeroed-files-1414967.txt zeroed-files-1414389.txt" shows that they have the same broken stuff in the range cloned from the template database by CREATE DATABASE STRATEGY=WAL_LOG, and it looks like it's *all* the cloned catalogs, and then they have some non-matching relfilenodes > 1400000, presumably stuff you created directly in the new database (I'm not sure if I can say for sure that those files are broken, without knowing what they are). Did you previously run this same workload on versions < 15 and never see any problem? 15 gained a new feature CREATE DATABASE ... STRATEGY=WAL_LOG, which is also the default. I wonder if there is a bug somewhere near that, though I have no specific idea. If you explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands, you'll get the traditional behaviour. It seems like you have some kind of high frequency testing workload that creates and tests databases all day long, and just occasionally detects this corruption. Would you like to try requesting FILE_COPY for a while and see if it eventually happens like that too? My spidey sense is leaning away from filesystem bugs. We've found plenty of filesystem bugs on these mailing lists over the years and of course it's not impossible, but I dunno... it seems quite suspicious that all the system catalogs have apparently been wiped during or moments after the creation of a new database that's running new PostgreSQL 15 code...