Re: "ERROR: latch already owned" on gharial

Heikki Linnakangas Thu, 08 Feb 2024 04:58:05 -0800

On 08/02/2024 04:08, Soumyadeep Chakraborty wrote:

A possible ordering of events:


(1) DisownLatch() is called by pid Y during ProcKill() and the write for
latch->owner_pid = 0 is NOT yet flushed to shmem.

(2) The PGPROC object for pid Y is returned to the free list.

(3) Pid X sees the same PGPROC object on the free list and grabs it.

(4) Pid X does sanity check inside OwnLatch during InitProcess and
still sees the
old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.

The above sequence of operations should apply to PG HEAD as well.

Suggestion:

Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
ResetLatch(), like the one introduced in [3]? This would ensure that the write
latch->owner_pid = 0; is flushed to shmem. The attached patch does this.

Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() inProcKill(), before step 3 can happen. Comment in spin.h aboutSpinLockAcquire/Release:

 *      Load and store operations in calling code are guaranteed not to be
 *      reordered with respect to these operations, because they include a
 *      compiler barrier.  (Before PostgreSQL 9.5, callers needed to use a
 *      volatile qualifier to access data protected by spinlocks.)

That talks about a compiler barrier, though, not a memory barrier. Butlooking at the implementations in s_lock.h, I believe they do act asmemory barrier, too.

So you might indeed have that problem on 9.4, but AFAICS not on laterversions.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: "ERROR: latch already owned" on gharial

Reply via email to