Andres Freund <and...@anarazel.de> writes: > On 2019-09-20 16:25:21 -0400, Tom Lane wrote: >> I recreated my freebsd-9-under-qemu setup and I can still reproduce >> the problem, though not with high reliability (order of 1 time in 10). >> Anything particular you want logged?
> A DEBUG2 log would help a fair bit, because it'd log some information > about what changes the "horizons" determining when data may be removed. Actually, what I did was as attached [1], and I am getting traces like [2]. The problem seems to occur only when there are two or three processes concurrently creating the same snapshot file. It's not obvious from the debug trace, but the snapshot file *does* exist after the music stops. It is very hard to look at this trace and conclude anything other than "rename(2) is broken, it's not atomic". Nothing in our code has deleted the file: no checkpoint has started, nor do we see the DEBUG1 output that CheckPointSnapBuild ought to produce. But fsync_fname momentarily can't see it (and then later another process does see it). It is now apparent why we're only seeing this on specific ancient platforms. I looked around for info about rename(2) not being atomic, and I found this info about FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=94849 The reported symptom there isn't quite the same, so probably there is another issue, but there is plenty of reason to be suspicious that UFS rename(2) is buggy in this release. As for dromedary's ancient version of macOS, Apple is exceedinly untransparent about their bugs, but I found http://www.weirdnet.nl/apple/rename.html In short, what we got here is OS bugs that have probably been resolved years ago. The question is what to do next. Should we just retire these specific buildfarm critters, or do we want to push ahead with getting rid of the PANIC here? regards, tom lane