Hey hackers,

I wanted to report that we have seen this issue (with the procLatch) a few
times very sporadically on Greenplum 6X (based on 9.4), with relatively newer
versions of GCC.

I realize that 9.4 is out of support, so this email is purely to add on to the
existing thread, in case the info can help fix/reveal something in supported
versions.

Unfortunately, we don't have a core to share as we don't have the benefit of
commit [1] in Greenplum 6X, but we do possess commit [2] which gives us an elog
ERROR as opposed to PANIC.

Instance 1:

Event 1: 2023-11-13 10:01:31.927168 CET..., pY,
..."LOG","00000","disconnection: session time: ..."
Event 2: 2023-11-13 10:01:32.049135
CET...,pX,,,,,"FATAL","XX000","latch already owned by pid Y (is_set:
0) (pg_latch.c:159)",,,,,,,0,,
"pg_latch.c",159,"Stack trace:
1    0xbde8b8 postgres errstart (elog.c:567)
2    0xbe0768 postgres elog_finish (discriminator 7)
3    0xa08924 postgres <symbol not found> (pg_latch.c:158) <---------- OwnLatch
4    0xa7f179 postgres InitProcess (proc.c:523)
5    0xa94ac3 postgres PostgresMain (postgres.c:4874)
6    0xa1e2ed postgres <symbol not found> (postmaster.c:2860)
7    0xa1f295 postgres PostmasterMain (discriminator 5)
...
"LOG","00000","server process (PID Y) exited with exit code
1",,,,,,,0,,"postmaster.c",3987,

Instance 2 (was reported with (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)):

Exactly the same as Instance 1 with identical log, ordering of events and stack
trace, except this time (is_set: 1) when the ERROR is logged.

A possible ordering of events:

(1) DisownLatch() is called by pid Y during ProcKill() and the write for
latch->owner_pid = 0 is NOT yet flushed to shmem.

(2) The PGPROC object for pid Y is returned to the free list.

(3) Pid X sees the same PGPROC object on the free list and grabs it.

(4) Pid X does sanity check inside OwnLatch during InitProcess and
still sees the
old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.

The above sequence of operations should apply to PG HEAD as well.

Suggestion:

Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
ResetLatch(), like the one introduced in [3]? This would ensure that the write
latch->owner_pid = 0; is flushed to shmem. The attached patch does this.

I'm not sure why we didn't introduce a memory barrier in DisownLatch() in [3].
I didn't find anything in the associated hackers thread [4] either. Was it the
performance impact, or was it just because SetLatch and ResetLatch
were more racy
and this is way less likely to happen?

This is out of my wheelhouse, but would one additional barrier in a process'
lifecycle be that bad for performance?

Appendix:

Build details: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)

CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -fno-aggressive-loop-optimizations
-Wno-unused-but-set-variable -Wno-address -Werror=implicit-fallthrough=3
-Wno-format-truncation -Wno-stringop-truncation -m64 -O3
-fargument-noalias-global -fno-omit-frame-pointer -g -std=gnu99
-Werror=uninitialized -Werror=implicit-function-declaration

Regards,
Soumyadeep (VMware)

[1] 
https://github.com/postgres/postgres/commit/12e28aac8e8eb76cab13a4e9b696e3dab17f1c99
[2] 
https://github.com/greenplum-db/gpdb/commit/81fdd6c5219af865e9dc41f4087e0405d6616050
[3] 
https://github.com/postgres/postgres/commit/14e8803f101a54d99600683543b0f893a2e3f529
[4] 
https://www.postgresql.org/message-id/flat/20150112154026.GB2092%40awork2.anarazel.de
From 9702054c37aa80cb60a16b2d80a475e9b27b087a Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty <soumyadeep2...@gmail.com>
Date: Wed, 7 Feb 2024 18:04:43 -0800
Subject: [PATCH v1 1/1] Use memory barrier in DisownLatch

Failing to do so may cause another process undergoing startup to see our
pid associated with the latch during OwnLatch's sanity check.
---
 src/backend/storage/ipc/latch.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index d8a69990b3a..332c5e0b2cf 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -493,6 +493,12 @@ DisownLatch(Latch *latch)
 	Assert(latch->owner_pid == MyProcPid);
 
 	latch->owner_pid = 0;
+
+	/*
+	 * Ensure that another backend reusing this PGPROC during startup sees that
+	 * owner_pid = 0 and doesn't blow up in OwnLatch().
+	 */
+	pg_memory_barrier();
 }
 
 /*
-- 
2.34.1

Reply via email to