On 2019-01-28 22:37:53 -0500, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > I did that now. I couldn't reproduce it locally, despite a lot of > > runs. Looking at the buildfarm it looks like the failures were, > > excluding handfish which failed without recognizable symptoms before and > > after, on BSD derived platforms (netbsd, freebsd, OX), which certainly > > is interesting. > > Isn't it now. Something about the BSD scheduler perhaps? But we've > got four or five different BSD-ish platforms that reported failures, > and it's hard to believe they've all got identical schedulers. > > That second handfish failure does match the symptoms elsewhere: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=handfish&dt=2019-01-29%2000%3A20%3A22 > > --- > /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-thread.stderr > 2018-10-30 20:11:45.551967381 -0300 > +++ > /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-thread.stderr > 2019-01-28 22:38:20.614211568 -0200 > @@ -0,0 +1,20 @@ > +SQL error: page 0 of relation "test_thread" should be empty but is not on > line 125 > > so it's not quite 100% BSD, but certainly the failure rate on BSD is > way higher than elsewhere. Puzzling.
Interesting. While chatting with Robert about this issue I came across the following section of code: /* * If the FSM knows nothing of the rel, try the last page before we * give up and extend. This avoids one-tuple-per-page syndrome during * bootstrapping or in a recently-started system. */ if (targetBlock == InvalidBlockNumber) { BlockNumber nblocks = RelationGetNumberOfBlocks(relation); if (nblocks > 0) targetBlock = nblocks - 1; } I think that explains the issue (albeit not why it is much more frequent on BSDs). Because we're not going through the FSM, it's perfectly possible to find a page that is uninitialized, *and* is not yet in the FSM. The only reason this wasn't previously actively broken, I think, is that while we previously *also* looked that page (before the extending backend acquired a lock!), when looking at the page PageGetHeapFreeSpace(), via PageGetFreeSpace(), decides there's no free space because it just interprets the zeroes in pd_upper - pd_lower as no free space. Hm, thinking about what a good solution here could be. Greetings, Andres Freund