On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund <and...@anarazel.de> wrote: >> HEAD/REL_11_STABLE apparently solely being affected points elsewhere, >> but I don't immediatley know where. > > Hm, there was: > http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de > > > I don't immediately see it being responsible, but I wonder if there's a > chance it actually is: Note that it happens in a parallel group that > includes vacuum.sql, which does a VACUUM FULL pg_class - but I still > don't immediately see how it could apply.
It's now pretty clear that it was not that particular bug, since I pushed a fix, and yet the issue hasn't gone away on affected buildfarm animals. There was a recurrence of the problem on lapwing, for example [1]. Anyway, "VACUUM FULL pg_class" should be expected to corrupt pg_class_oid_index when we happen to get a parallel build, since pg_class is a mapped relation, and I've identified that as a problem for parallel CREATE INDEX [2]. If that was the ultimate cause of the issue, it would explain why only REL_11_STABLE and master are involved. My guess is that the metapage considers the root page to be at block 3 (block 3 is often the root page for small though not tiny B-Trees), which for whatever reason is where we get a short read. I don't know why there is a short read, but corrupting mapped catalog indexes at random can be expected to cause all kinds of chaos, so that doesn't mean much. In any case, I'll probably push a fix for this other bug on Friday, barring any objections. It's possible that that will make the problem go away. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2018-08-04%2004%3A20%3A01 [2] https://www.postgresql.org/message-id/CAH2-Wzn=j0i8rxCAo6E=tbo9xuyxb8hbusnw7j_stkon8dd...@mail.gmail.com -- Peter Geoghegan