Hello, In tablespace.c, a comment explains that DROP TABLESPACE can fail bogusly because of Windows file semantics:
* XXX On Windows, an unlinked file persists in the directory listing * until no process retains an open handle for the file. The DDL * commands that schedule files for unlink send invalidation messages * directing other PostgreSQL processes to close the files. DROP * TABLESPACE should not give up on the tablespace becoming empty * until all relevant invalidation processing is complete. While trying to get the AIO patchset working on more operating systems, this turned out to be a problem. Andres mentioned the new ProcSignalBarrier stuff as a good way to tackle this, so I tried it and it seems to work well so far. The idea in this initial version is to tell every process in the cluster to close all fds, and then try again. That's a pretty large hammer, but it isn't reached on Unix, and with slightly more work it could be made to happen only after 2 failures on Windows. It was tempting to try to figure out how to use the sinval mechanism to close precisely the right files instead, but it doesn't look safe to run sinval at arbitrary CFI() points. It's easier to see that the pre-existing closeAllVfds() function has an effect that is local to fd.c and doesn't affect the VFDs or SMgrRelations, so any CFI() should be an OK time to run that. While reading the ProcSignalBarrier code, I couldn't resist replacing its poll/sleep loop with condition variables. Thoughts?
From bf32e7426cf780d33ae1140c443ae8964397a3c9 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sat, 30 Jan 2021 14:02:32 +1300 Subject: [PATCH 1/2] Use a global barrier to fix DROP TABLESPACE on Windows. Previously, it was possible for DROP TABLESPACE to fail because a table had been dropped, but other backends still had open file handles. Windows won't allow a directory containing unlinked-but-still-open files to be unlinked. Tackle this problem by forcing all backends to close all fds. No change for Unix systems, which didn't suffer from the problem, but simulate the need to do it in assertion builds just to exercise the code. XXX maybe only do this after 2nd failure? --- src/backend/commands/tablespace.c | 22 +++++++++++++++------- src/backend/storage/ipc/procsignal.c | 28 ++++++++++++---------------- src/backend/storage/smgr/md.c | 6 ++++++ src/backend/storage/smgr/smgr.c | 6 ++++++ src/include/storage/md.h | 1 + src/include/storage/procsignal.h | 7 +------ src/include/storage/smgr.h | 1 + 7 files changed, 42 insertions(+), 29 deletions(-) diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c index 69ea155d50..8853f1b658 100644 --- a/src/backend/commands/tablespace.c +++ b/src/backend/commands/tablespace.c @@ -520,15 +520,23 @@ DropTableSpace(DropTableSpaceStmt *stmt) * but we can't tell them apart from important data files that we * mustn't delete. So instead, we force a checkpoint which will clean * out any lingering files, and try again. - * - * XXX On Windows, an unlinked file persists in the directory listing - * until no process retains an open handle for the file. The DDL - * commands that schedule files for unlink send invalidation messages - * directing other PostgreSQL processes to close the files. DROP - * TABLESPACE should not give up on the tablespace becoming empty - * until all relevant invalidation processing is complete. */ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); + /* + * On Windows, an unlinked file persists in the directory listing until + * no process retains an open handle for the file. The DDL + * commands that schedule files for unlink send invalidation messages + * directing other PostgreSQL processes to close the files, but nothing + * guarantees they'll be processed in time. So, we'll also use a + * global barrier to ask all backends to close all files, and wait + * until they're finished. + */ +#if defined(WIN32) || defined(USE_ASSERT_CHECKING) + LWLockRelease(TablespaceCreateLock); + WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); +#endif + /* And now try again. */ if (!destroy_tablespace_directories(tablespaceoid, false)) { /* Still not empty, the files must be important then */ diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index c43cdd685b..c7cce5967d 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -27,6 +27,7 @@ #include "storage/latch.h" #include "storage/proc.h" #include "storage/shmem.h" +#include "storage/smgr.h" #include "storage/sinval.h" #include "tcop/tcopprot.h" @@ -98,7 +99,7 @@ static volatile ProcSignalSlot *MyProcSignalSlot = NULL; static bool CheckProcSignal(ProcSignalReason reason); static void CleanupProcSignalState(int status, Datum arg); static void ResetProcSignalBarrierBits(uint32 flags); -static bool ProcessBarrierPlaceholder(void); +static bool ProcessBarrierSmgrRelease(void); /* * ProcSignalShmemSize @@ -413,6 +414,12 @@ WaitForProcSignalBarrier(uint64 generation) CHECK_FOR_INTERRUPTS(); + /* + * Since interrupts may be held, also process barrier requests. + * This avoids (self) deadlocks. + */ + ProcessProcSignalBarrier(); + events = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, @@ -538,8 +545,8 @@ ProcessProcSignalBarrier(void) type = (ProcSignalBarrierType) pg_rightmost_one_pos32(flags); switch (type) { - case PROCSIGNAL_BARRIER_PLACEHOLDER: - processed = ProcessBarrierPlaceholder(); + case PROCSIGNAL_BARRIER_SMGRRELEASE: + processed = ProcessBarrierSmgrRelease(); break; } @@ -605,20 +612,9 @@ ResetProcSignalBarrierBits(uint32 flags) } static bool -ProcessBarrierPlaceholder(void) +ProcessBarrierSmgrRelease(void) { - /* - * XXX. This is just a placeholder until the first real user of this - * machinery gets committed. Rename PROCSIGNAL_BARRIER_PLACEHOLDER to - * PROCSIGNAL_BARRIER_SOMETHING_ELSE where SOMETHING_ELSE is something - * appropriately descriptive. Get rid of this function and instead have - * ProcessBarrierSomethingElse. Most likely, that function should live in - * the file pertaining to that subsystem, rather than here. - * - * The return value should be 'true' if the barrier was successfully - * absorbed and 'false' if not. Note that returning 'false' can lead to - * very frequent retries, so try hard to make that an uncommon case. - */ + smgrrelease(); return true; } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 0643d714fb..93a2aaabd9 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -548,6 +548,12 @@ mdclose(SMgrRelation reln, ForkNumber forknum) } } +void +mdrelease(void) +{ + closeAllVfds(); +} + /* * mdprefetch() -- Initiate asynchronous read of the specified block of a relation */ diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 4dc24649df..0f8548747c 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -298,6 +298,12 @@ smgrcloseall(void) smgrclose(reln); } +void +smgrrelease(void) +{ + mdrelease(); +} + /* * smgrclosenode() -- Close SMgrRelation object for given RelFileNode, * if one exists. diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..7685367b97 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,6 +23,7 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); +extern void mdrelease(void); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h index 4ae7dc33b8..aaaa8ed07c 100644 --- a/src/include/storage/procsignal.h +++ b/src/include/storage/procsignal.h @@ -48,12 +48,7 @@ typedef enum typedef enum { - /* - * XXX. PROCSIGNAL_BARRIER_PLACEHOLDER should be replaced when the first - * real user of the ProcSignalBarrier mechanism is added. It's just here - * for now because we can't have an empty enum. - */ - PROCSIGNAL_BARRIER_PLACEHOLDER = 0 + PROCSIGNAL_BARRIER_SMGRRELEASE /* ask smgr to close files */ } ProcSignalBarrierType; /* diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a6fbf7b6a6..1a21bd9e88 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -84,6 +84,7 @@ extern void smgrsetowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); +extern void smgrrelease(void); extern void smgrclosenode(RelFileNodeBackend rnode); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); -- 2.20.1
From 434723a3066c62f8dbdb238371b795bb56af624b Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sat, 30 Jan 2021 16:32:49 +1300 Subject: [PATCH 2/2] Use condition variables for ProcSignalBarriers. Instead of a poll/sleep loop, use condition variables for precise wake-up. Adjust the condition variable sleep loop to support code that runs inside CFI() code that might cause the current cv_sleep_target to move under its feet without becoming corrupted. --- src/backend/storage/ipc/procsignal.c | 37 ++++++------------- src/backend/storage/lmgr/condition_variable.c | 11 +++++- 2 files changed, 21 insertions(+), 27 deletions(-) diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index c7cce5967d..c9aed86af0 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -23,6 +23,7 @@ #include "miscadmin.h" #include "pgstat.h" #include "replication/walsender.h" +#include "storage/condition_variable.h" #include "storage/ipc.h" #include "storage/latch.h" #include "storage/proc.h" @@ -60,10 +61,11 @@ */ typedef struct { - pid_t pss_pid; - sig_atomic_t pss_signalFlags[NUM_PROCSIGNALS]; + volatile pid_t pss_pid; + volatile sig_atomic_t pss_signalFlags[NUM_PROCSIGNALS]; pg_atomic_uint64 pss_barrierGeneration; pg_atomic_uint32 pss_barrierCheckMask; + ConditionVariable pss_barrierCV; } ProcSignalSlot; /* @@ -94,7 +96,7 @@ typedef struct ((flags) &= ~(((uint32) 1) << (uint32) (type))) static ProcSignalHeader *ProcSignal = NULL; -static volatile ProcSignalSlot *MyProcSignalSlot = NULL; +static ProcSignalSlot *MyProcSignalSlot = NULL; static bool CheckProcSignal(ProcSignalReason reason); static void CleanupProcSignalState(int status, Datum arg); @@ -143,6 +145,7 @@ ProcSignalShmemInit(void) MemSet(slot->pss_signalFlags, 0, sizeof(slot->pss_signalFlags)); pg_atomic_init_u64(&slot->pss_barrierGeneration, PG_UINT64_MAX); pg_atomic_init_u32(&slot->pss_barrierCheckMask, 0); + ConditionVariableInit(&slot->pss_barrierCV); } } } @@ -157,7 +160,7 @@ ProcSignalShmemInit(void) void ProcSignalInit(int pss_idx) { - volatile ProcSignalSlot *slot; + ProcSignalSlot *slot; uint64 barrier_generation; Assert(pss_idx >= 1 && pss_idx <= NumProcSignalSlots); @@ -392,13 +395,11 @@ EmitProcSignalBarrier(ProcSignalBarrierType type) void WaitForProcSignalBarrier(uint64 generation) { - long timeout = 125L; - Assert(generation <= pg_atomic_read_u64(&ProcSignal->psh_barrierGeneration)); for (int i = NumProcSignalSlots - 1; i >= 0; i--) { - volatile ProcSignalSlot *slot = &ProcSignal->psh_slot[i]; + ProcSignalSlot *slot = &ProcSignal->psh_slot[i]; uint64 oldval; /* @@ -410,26 +411,11 @@ WaitForProcSignalBarrier(uint64 generation) oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration); while (oldval < generation) { - int events; - - CHECK_FOR_INTERRUPTS(); - - /* - * Since interrupts may be held, also process barrier requests. - * This avoids (self) deadlocks. - */ - ProcessProcSignalBarrier(); - - events = - WaitLatch(MyLatch, - WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, - timeout, WAIT_EVENT_PROC_SIGNAL_BARRIER); - ResetLatch(MyLatch); - + ConditionVariableSleep(&slot->pss_barrierCV, + WAIT_EVENT_PROC_SIGNAL_BARRIER); oldval = pg_atomic_read_u64(&slot->pss_barrierGeneration); - if (events & WL_TIMEOUT) - timeout = Min(timeout * 2, 1000L); } + ConditionVariableCancelSleep(); } /* @@ -596,6 +582,7 @@ ProcessProcSignalBarrier(void) * next called. */ pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierGeneration, shared_gen); + ConditionVariableBroadcast(&MyProcSignalSlot->pss_barrierCV); } /* diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index 0a61ff0031..3b2f4fd8b7 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -165,8 +165,6 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout, /* Reset latch before examining the state of the wait list. */ ResetLatch(MyLatch); - CHECK_FOR_INTERRUPTS(); - /* * If this process has been taken out of the wait list, then we know * that it has been signaled by ConditionVariableSignal (or @@ -190,6 +188,15 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout, } SpinLockRelease(&cv->mutex); + /* + * Check for interrupts, and return spuriously if that caused the + * current sleep target to change incidentally through using other CVs + * (notably ProcSignalBarrier). + */ + CHECK_FOR_INTERRUPTS(); + if (cv != cv_sleep_target) + done = true; + /* We were signaled, so return */ if (done) return false; -- 2.20.1