On 03/04/2018 10:27 AM, Thomas Munro wrote: > On Sun, Mar 4, 2018 at 5:40 PM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: >> Could shm_mq_detach_internal() need a pg_write_barrier() before it >> writes mq_detached = true, to make sure that anyone who observes that >> can also see the most recent increase of mq_bytes_written? > > I can reproduce both failure modes (missing tuples and "lost contact") > in the regression database with the attached Python script on my Mac. > It takes a few minutes and seems to be happen sooner when my machine > is also doing other stuff (playing debugging music...). > > I can reproduce it at 34db06ef9a1d7f36391c64293bf1e0ce44a33915 > "shm_mq: Reduce spinlock usage." but (at least so far) not at the > preceding commit. > > I can fix it with the following patch, which writes XXX out to the log > where it would otherwise miss a final message sent just before > detaching with sufficiently bad timing/memory ordering. This patch > isn't my proposed fix, it's just a demonstration of what's busted. > There could be a better way to structure things than this. >
I can confirm this resolves the issue for me. Before the patch, I've seen 112 failures in ~11500 runs. With the patch I saw 0 failures, but about 100 messages XXX in the log. So my conclusion is that your analysis is likely correct. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services