On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
>
> Dear Sawada-san,
>
> 28.04.2026 22:27, Masahiko Sawada wrote:
> > On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <[email protected]> 
> > wrote:
> >> I've been puzzled by a buildfarm failure [1] with such symptoms for a while
> >> and even reproduced it locally once, but couldn't gather more information
> >> that time. But now that you have described the scenario, I can easily
> >> reproduce the same test failure with:
> >> --- a/src/backend/storage/ipc/procsignal.c
> >> +++ b/src/backend/storage/ipc/procsignal.c
> >> @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int 
> >> cancel_key_len)
> >>          if (cancel_key_len > 0)
> >>                  memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
> >>          slot->pss_cancel_key_len = cancel_key_len;
> >> +pg_usleep(10000);
> >>          pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
> > Thank you for testing this.
> >
> > I've attached a patch to address the issue. I haven't verified it
> > across all versions yet, but I suspect it exists in the stable
> > branches as well...
>
> Thank you for the fix! It works for me too.
>
> I was wondering why is that failure the only one of this kind on buildfarm
> (in last two years, at least), so I've tried to reproduce it on
> REL_18_STABLE... and failed.
>
> Then I've bisected it on the master branch and found (your) commit that
> introduced this behavior: 67c20979c from 2025-12-23.
>

I've confirmed that this race condition issue is present from v15 to
the master. In v14, we have the procsignal barrier code but don't use
it anywhere. In v18 or older, it could happen when executing DROP
DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
in more cases as we're using procsignal barrier more places. In any
case, if a process emits a signal barrier when another process is
between the initialization of slot->pss_barrierGeneration and
slot->pss_pid initialization, the subsequent
WaitForProcSignalBarrier() ends up waiting for that process forever.
So I think the patch should be backpatched to v15. Please review these
patches.

FYI I found that we had a similar report[1]  last year, I'm not sure
it hit the exact same issue, though.

Regards,

[1] 
https://www.postgresql.org/message-id/cagqgydtavkg3dbtebtyxzlm48jmzr2bcvteybswlv5hvwsb...@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From 63ce4e5578f1703254952cd3aee3a0a22c6da990 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_15] Fix race between ProcSignalInit() and
 EmitProcSignalBarrier().

Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.

This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.

While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.

This issue was also reported by buildfarm animal flaviventris.

Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com
Backpatch-through: 15
---
 src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 21a9fc0fdd2..cd4fe11b1a6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -175,6 +175,16 @@ ProcSignalInit(int pss_idx)
 	/* Clear out any leftover signal reasons */
 	MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
 
+	/*
+	 * Publish the PID before reading the global barrier generation to ensure
+	 * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+	 * older generation. We need a memory barrier here to make sure that the
+	 * update of pss_pid is globally visible before the load of the global
+	 * barrier generation executes.
+	 */
+	slot->pss_pid = MyProcPid;
+	pg_memory_barrier();
+
 	/*
 	 * Initialize barrier state. Since we're a brand-new process, there
 	 * shouldn't be any leftover backend-private state that needs to be
@@ -192,9 +202,6 @@ ProcSignalInit(int pss_idx)
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
 	pg_memory_barrier();
 
-	/* Mark slot with my PID */
-	slot->pss_pid = MyProcPid;
-
 	/* Remember slot location for CheckProcSignal */
 	MyProcSignalSlot = slot;
 
-- 
2.54.0

From 099e5a2b1bf0c8631f6b5f2a4bfba4ee039b5d5b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_17] Fix race between ProcSignalInit() and
 EmitProcSignalBarrier().

Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.

This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.

While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.

This issue was also reported by buildfarm animal flaviventris.

Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com
Backpatch-through: 15
---
 src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index d6857f5a8bb..86f39e42ad6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -175,6 +175,16 @@ ProcSignalInit(void)
 	/* Clear out any leftover signal reasons */
 	MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
 
+	/*
+	 * Publish the PID before reading the global barrier generation to ensure
+	 * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+	 * older generation. We need a memory barrier here to make sure that the
+	 * update of pss_pid is globally visible before the load of the global
+	 * barrier generation executes.
+	 */
+	slot->pss_pid = MyProcPid;
+	pg_memory_barrier();
+
 	/*
 	 * Initialize barrier state. Since we're a brand-new process, there
 	 * shouldn't be any leftover backend-private state that needs to be
@@ -192,9 +202,6 @@ ProcSignalInit(void)
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
 	pg_memory_barrier();
 
-	/* Mark slot with my PID */
-	slot->pss_pid = MyProcPid;
-
 	/* Remember slot location for CheckProcSignal */
 	MyProcSignalSlot = slot;
 
-- 
2.54.0

From cbac8a3b949a893f530150a1da212bc67a46af00 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_18] Fix race between ProcSignalInit() and
 EmitProcSignalBarrier().

Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.

This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.

While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.

This issue was also reported by buildfarm animal flaviventris.

Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com
Backpatch-through: 15
---
 src/backend/storage/ipc/procsignal.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 05d99b452c3..a0117ef969b 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -185,6 +185,16 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	/* Clear out any leftover signal reasons */
 	MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
 
+	/*
+	 * Publish the PID before reading the global barrier generation to ensure
+	 * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+	 * older generation. We need a memory barrier here to make sure that the
+	 * update of pss_pid is globally visible before the load of the global
+	 * barrier generation executes.
+	 */
+	pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
+	pg_memory_barrier();
+
 	/*
 	 * Initialize barrier state. Since we're a brand-new process, there
 	 * shouldn't be any leftover backend-private state that needs to be
@@ -204,7 +214,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	if (cancel_key_len > 0)
 		memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
 	slot->pss_cancel_key_len = cancel_key_len;
-	pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
 
 	SpinLockRelease(&slot->pss_mutex);
 
-- 
2.54.0

From a4c69b7ef9eacb79581dd2622ad8e107089b0dd2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_16] Fix race between ProcSignalInit() and
 EmitProcSignalBarrier().

Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.

This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.

While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.

This issue was also reported by buildfarm animal flaviventris.

Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com
Backpatch-through: 15
---
 src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c85cb5cc18d..01186ab91fb 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -176,6 +176,16 @@ ProcSignalInit(int pss_idx)
 	/* Clear out any leftover signal reasons */
 	MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
 
+	/*
+	 * Publish the PID before reading the global barrier generation to ensure
+	 * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+	 * older generation. We need a memory barrier here to make sure that the
+	 * update of pss_pid is globally visible before the load of the global
+	 * barrier generation executes.
+	 */
+	slot->pss_pid = MyProcPid;
+	pg_memory_barrier();
+
 	/*
 	 * Initialize barrier state. Since we're a brand-new process, there
 	 * shouldn't be any leftover backend-private state that needs to be
@@ -193,9 +203,6 @@ ProcSignalInit(int pss_idx)
 	pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
 	pg_memory_barrier();
 
-	/* Mark slot with my PID */
-	slot->pss_pid = MyProcPid;
-
 	/* Remember slot location for CheckProcSignal */
 	MyProcSignalSlot = slot;
 
-- 
2.54.0

From c8de0ff6283b620d3f81957c3a1947f3c024bd68 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_master] Fix race between ProcSignalInit() and
 EmitProcSignalBarrier().

Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.

This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.

While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.

This issue was also reported by buildfarm animal flaviventris.

Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/caeze2wgajmwredn7chtba8er2ybvkcoa0kvn25-1evntrhs...@mail.gmail.com
Backpatch-through: 15
---
 src/backend/storage/ipc/procsignal.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 264e4c22ca6..b0681ca0ae2 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -188,6 +188,16 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	/* Clear out any leftover signal reasons */
 	MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
 
+	/*
+	 * Publish the PID before reading the global barrier generation to ensure
+	 * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+	 * older generation. We need a memory barrier here to make sure that the
+	 * update of pss_pid is globally visible before the load of the global
+	 * barrier generation executes.
+	 */
+	pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
+	pg_memory_barrier();
+
 	/*
 	 * Initialize barrier state. Since we're a brand-new process, there
 	 * shouldn't be any leftover backend-private state that needs to be
@@ -207,7 +217,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
 	if (cancel_key_len > 0)
 		memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
 	slot->pss_cancel_key_len = cancel_key_len;
-	pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
 
 	SpinLockRelease(&slot->pss_mutex);
 
-- 
2.54.0

Reply via email to