On Tue, Feb 12, 2019 at 10:15 PM Sergei Kornilov <s...@zsrv.org> wrote: > I still have error with parallel_leader_participation = off.
Justin very kindly set up a virtual machine similar to the one where he'd seen the problem so I could experiment with it. Eventually I also managed to reproduce it locally, and have finally understood the problem. It doesn't happen on master (hence some of my initial struggle to reproduce it) because of commit 197e4af9, which added srandom() to set a different seed for each parallel workers. Perhaps you see where this is going already... The problem is that a DSM handle (ie a random number) can be reused for a new segment immediately after the shared memory object has been destroyed but before the DSM slot has been released. Now two DSM slots have the same handle, and dsm_attach() can be confused by the old segment and give up. Here's a draft patch to fix that. It also clears the handle in a case where it wasn't previously cleared, but that wasn't strictly necessary. It just made debugging less confusing. -- Thomas Munro http://www.enterprisedb.com
From 5d94eac7d11e74c280aa32a98d14eaeca40679de Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.munro@enterprisedb.com> Date: Fri, 15 Feb 2019 00:50:52 +1300 Subject: [PATCH] Fix race in dsm_attach() when handles are recycled. DSM handles can be recycled as soon as the underlying shared memory object has been destroyed. That means that for a brief moment we might have two DSM slots with the same handle. While trying to attach, if we encounter a slot with refcnt == 1, meaning that it is being destroyed, we should continue our search in case the same handle exists in another slot. The failure was more likely in the back branches, due to the lack of distinct seed for random(), so that handle reuse is more likely. In passing, clear the handle when dsm_unpin_segment() frees a segment. That wasn't known to be an active bug, but it was untidy. Back-patch to 10. Author: Thomas Munro Reported-by: Justin Pryzby Discussion: 20190207014719.GJ29720@telsasoft.com">https://postgr.es/m/20190207014719.GJ29720@telsasoft.com --- src/backend/storage/ipc/dsm.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c index cab7ae74ca..b01aea9de9 100644 --- a/src/backend/storage/ipc/dsm.c +++ b/src/backend/storage/ipc/dsm.c @@ -580,10 +580,11 @@ dsm_attach(dsm_handle h) /* * If the reference count is 1, the slot is still in use, but the * segment is in the process of going away. Treat that as if we - * didn't find a match. + * didn't find a match. Keep looking though; it's possible for the + * handle to have been reused already. */ if (dsm_control->item[i].refcnt == 1) - break; + continue; /* Otherwise we've found a match. */ dsm_control->item[i].refcnt++; @@ -909,6 +910,7 @@ dsm_unpin_segment(dsm_handle handle) Assert(dsm_control->item[control_slot].handle == handle); Assert(dsm_control->item[control_slot].refcnt == 1); dsm_control->item[control_slot].refcnt = 0; + dsm_control->item[control_slot].handle = DSM_HANDLE_INVALID; LWLockRelease(DynamicSharedMemoryControlLock); } } -- 2.20.1