On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真) <buzhen....@alibaba-inc.com> wrote: > I found some other problems which I want to share my change with you to make > you confirm. > <1> I changed the code in smgr_alloc_sr to avoid dead lock. > > LWLockAcquire(mapping_lock, LW_EXCLUSIVE); > flags = smgr_lock_sr(sr); > Assert(flags & SR_SYNCING(forknum)); > + flags &= ~SR_SYNCING(forknum); > if (flags & SR_JUST_DIRTIED(forknum)) > { > /* > * Someone else dirtied it while we were syncing, so we can't mark > * it clean. Let's give up on this SR and go around again. > */ > smgr_unlock_sr(sr, flags); > LWLockRelease(mapping_lock); > goto retry; > } > /* This fork is clean! */ > - flags &= ~SR_SYNCING(forknum); > flags &= ~SR_DIRTY(forknum); > } > > In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not > SR_SYNCING. But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will > retry to get another sr, although it has been synced by smgrimmedsync, the > flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed > your patch as above.
Right. Thanks! I also added smgrdropdb() to handle DROP DATABASE (the problem you reported in your previous email). While tweaking that code, I fixed it so that it uses a condition variable to wait (instead of the silly sleep loop) when it needs to drop an SR that is being sync'd. Also, it now releases the mapping lock while it's doing that, and requires it on retry. > <2> I changed the code in smgr_drop_sr to avoid some corner cases > /* Mark it invalid and drop the mapping. */ > smgr_unlock_sr(sr, ~SR_VALID); > + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) > + sr->nblocks[forknum] = InvalidBlockNumber; > hash_search_with_hash_value(sr_mapping_table, > &reln->smgr_rnode, > hash, > HASH_REMOVE, > NULL); > > smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't > remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so > I add some codes as above to avoid some corner cases get an unexpected result > from smgrnblocks_fast. Is it necessary, I also want some advice from you. Hmm. I think it might be better to increment sr->generation. That was already done in the "eviction" code path, where other processes might still have references to the SR object, and I don't think it's possible for anyone to access a dropped relation, but I suppose it's more consistent to do that anyway. Fixed. Thanks for the review!
From 1d49405b41b3cdd6baf6f29219f18c6adefb5547 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Fri, 13 Nov 2020 14:38:41 +1300 Subject: [PATCH v3 1/2] WIP: Track relation sizes in shared memory. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce a fixed size pool of SMgrSharedRelation objects. A new GUC smgr_shared_relation controls how many can exist at once, and they are evicted as required. "Dirty" SMgrSharedRelations can only be evicted after being synced to disk. Goals: 1. Provide faster lookups of relation sizes, cutting down on lseek() calls. This supercedes the recovery-only caching added recently, and replaces preexisting FSM and VM caching schemes. 2. Stop trusting the operating system to keep track of the size of files that we have recently extended, until fsync() has been called. XXX smgrimmedsync() is maybe too blunt an instrument? XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via some new interface, but it doesn't actually know if it's cleaning a fork that extended a relation... XXX perhaps bgwriter should try to clean them? XXX currently reusing the array of locks also used for buffer mapping, need to define some more in lwlocks.c... Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com Reviewed-by: 陈佳昕(步真) <buzhen....@alibaba-inc.com> Reviewed-by: Andy Fan <zhihui.fan1...@gmail.com> --- contrib/pg_visibility/pg_visibility.c | 1 - src/backend/access/heap/visibilitymap.c | 28 +- src/backend/catalog/storage.c | 2 - src/backend/commands/dbcommands.c | 3 + src/backend/postmaster/pgstat.c | 3 + src/backend/storage/freespace/freespace.c | 35 +- src/backend/storage/ipc/ipci.c | 3 + src/backend/storage/smgr/md.c | 10 + src/backend/storage/smgr/smgr.c | 540 ++++++++++++++++++++-- src/backend/utils/misc/guc.c | 11 + src/include/pgstat.h | 1 + src/include/storage/smgr.h | 18 +- 12 files changed, 566 insertions(+), 89 deletions(-) diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index 54e47b810f..702951e487 100644 --- a/contrib/pg_visibility/pg_visibility.c +++ b/contrib/pg_visibility/pg_visibility.c @@ -392,7 +392,6 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS) check_relation_relkind(rel); RelationOpenSmgr(rel); - rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber; block = visibilitymap_prepare_truncate(rel, 0); if (BlockNumberIsValid(block)) diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index b1072183bc..8fe29ecae7 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -547,6 +547,7 @@ static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend) { Buffer buf; + BlockNumber nblocks; /* * We might not have opened the relation at the smgr level yet, or we @@ -557,20 +558,12 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend) */ RelationOpenSmgr(rel); - /* - * If we haven't cached the size of the visibility map fork yet, check it - * first. - */ - if (rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) - { - if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) - smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM); - else - rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0; - } + if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) + smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false); + nblocks = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM); /* Handle requests beyond EOF */ - if (blkno >= rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM]) + if (blkno >= nblocks) { if (extend) vm_extend(rel, blkno + 1); @@ -636,17 +629,10 @@ vm_extend(Relation rel, BlockNumber vm_nblocks) /* Might have to re-open if a cache flush happened */ RelationOpenSmgr(rel); - /* - * Create the file first if it doesn't exist. If smgr_vm_nblocks is - * positive then it must exist, no need for an smgrexists call. - */ - if ((rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 || - rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) && - !smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) + /* * Create the file first if it doesn't exist. */ + if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false); - /* Invalidate cache so that smgrnblocks() asks the kernel. */ - rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber; vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM); /* Now extend the file */ diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index d538f25726..45f62030aa 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -290,8 +290,6 @@ RelationTruncate(Relation rel, BlockNumber nblocks) * Make sure smgr_targblock etc aren't pointing somewhere past new end */ rel->rd_smgr->smgr_targblock = InvalidBlockNumber; - for (int i = 0; i <= MAX_FORKNUM; ++i) - rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber; /* Prepare for truncation of MAIN fork of the relation */ forks[nforks] = MAIN_FORKNUM; diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index f27c3fe8c1..e046334bdb 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -970,6 +970,9 @@ dropdb(const char *dbname, bool missing_ok, bool force) */ DropDatabaseBuffers(db_id); + /* Drop SMGR relations. */ + smgrdropdb(db_id); + /* * Tell the stats collector to forget it immediately, too. */ diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 123369f4fa..2905bfb472 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -4039,6 +4039,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SAFE_SNAPSHOT: event_name = "SafeSnapshot"; break; + case WAIT_EVENT_SMGR_DROP_SYNC: + event_name = "SmgrDropSync"; + break; case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c index 6a96126b0c..629923c4d4 100644 --- a/src/backend/storage/freespace/freespace.c +++ b/src/backend/storage/freespace/freespace.c @@ -531,29 +531,18 @@ static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend) { BlockNumber blkno = fsm_logical_to_physical(addr); + BlockNumber nblocks; Buffer buf; RelationOpenSmgr(rel); - /* - * If we haven't cached the size of the FSM yet, check it first. Also - * recheck if the requested block seems to be past end, since our cached - * value might be stale. (We send smgr inval messages on truncation, but - * not on extension.) - */ - if (rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber || - blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM]) - { - /* Invalidate the cache so smgrnblocks asks the kernel. */ - rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber; - if (smgrexists(rel->rd_smgr, FSM_FORKNUM)) - smgrnblocks(rel->rd_smgr, FSM_FORKNUM); - else - rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = 0; - } + if (smgrexists(rel->rd_smgr, FSM_FORKNUM)) + nblocks = smgrnblocks(rel->rd_smgr, FSM_FORKNUM); + else + nblocks = 0; /* Handle requests beyond EOF */ - if (blkno >= rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM]) + if (blkno >= nblocks) { if (extend) fsm_extend(rel, blkno + 1); @@ -621,18 +610,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks) /* Might have to re-open if a cache flush happened */ RelationOpenSmgr(rel); - /* - * Create the FSM file first if it doesn't exist. If - * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no - * need for an smgrexists call. - */ - if ((rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == 0 || - rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) && - !smgrexists(rel->rd_smgr, FSM_FORKNUM)) + /* Create the FSM file first if it doesn't exist. */ + if (!smgrexists(rel->rd_smgr, FSM_FORKNUM)) smgrcreate(rel->rd_smgr, FSM_FORKNUM, false); - /* Invalidate cache so that smgrnblocks() asks the kernel. */ - rel->rd_smgr->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber; fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM); while (fsm_nblocks_now < fsm_nblocks) diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 96c2aaabbd..cfc55d3691 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -121,6 +121,7 @@ CreateSharedMemoryAndSemaphores(void) size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE, sizeof(ShmemIndexEnt))); size = add_size(size, dsm_estimate_size()); + size = add_size(size, smgr_shmem_size()); size = add_size(size, BufferShmemSize()); size = add_size(size, LockShmemSize()); size = add_size(size, PredicateLockShmemSize()); @@ -212,6 +213,8 @@ CreateSharedMemoryAndSemaphores(void) dsm_shmem_init(); + smgr_shmem_init(); + /* * Set up xlog, clog, and buffers */ diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 9889ad6ad8..7310a12b6a 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -1359,6 +1359,16 @@ mdsyncfiletag(const FileTag *ftag, char *path) need_to_close = true; } + /* + * XXX: We could have an interface smgrbeginclean() that would return true + * if it has managed to set SR_SYNC and clean SR_JUST_DIRTIED, and then if + * so, after our sync runs we could call smgrfinishclean() with our + * success/failure report, which would clear SR_DIRTY if SR_JUST_DIRTIED + * hasn't been set in the meantime. But... how can we know if *this* + * segment is one that represents an extension? SR_DIRTY is just + * interested in syncing extended blocks. + */ + /* Sync the file. */ result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC); save_errno = errno; diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 072bdd118f..1da2dfc250 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -19,13 +19,68 @@ #include "access/xlog.h" #include "lib/ilist.h" +#include "pgstat.h" +#include "port/atomics.h" +#include "port/pg_bitutils.h" #include "storage/bufmgr.h" +#include "storage/condition_variable.h" +#include "storage/shmem.h" #include "storage/ipc.h" +#include "storage/lwlock.h" #include "storage/md.h" #include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/inval.h" +/* + * An entry in the hash table that allows us to look up objects in the + * SMgrSharedRelation pool by rnode (+ backend). + */ +typedef struct SMgrSharedRelationMapping +{ + RelFileNodeBackend rnode; + int index; +} SMgrSharedRelationMapping; + +/* + * An object in shared memory tracks the size of the forks of a relation. + */ +struct SMgrSharedRelation +{ + RelFileNodeBackend rnode; + BlockNumber nblocks[MAX_FORKNUM + 1]; + pg_atomic_uint32 flags; +}; + +/* For now, we borrow the buffer managers array of locks. XXX fixme */ +#define SR_PARTITIONS NUM_BUFFER_PARTITIONS +#define SR_PARTITION_LOCK(hash) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET].lock) + +/* Flags. */ +#define SR_LOCKED 0x01 +#define SR_VALID 0x02 + +/* Each forknum gets its own dirty, syncing and just dirtied bits. */ +#define SR_DIRTY(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0)) +#define SR_SYNCING(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1)) +#define SR_JUST_DIRTIED(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2)) + +/* Masks to test if any forknum is currently dirty or syncing. */ +#define SR_SYNCING_MASK (((SR_SYNCING(MAX_FORKNUM + 1) - 1) ^ (SR_SYNCING(0) - 1))) +#define SR_DIRTY_MASK (((SR_DIRTY(MAX_FORKNUM + 1) - 1) ^ (SR_DIRTY(0) - 1))) + +/* Extract the lowest dirty forknum from flags (there must be at least one). */ +#define SR_GET_ONE_DIRTY(mask) pg_rightmost_one_pos32((((mask) >> 2) & (SR_DIRTY_MASK >> 2))) + +typedef struct SMgrSharedRelationPool +{ + ConditionVariable sync_flags_cleared; + pg_atomic_uint32 next; + SMgrSharedRelation objects[FLEXIBLE_ARRAY_MEMBER]; +} SMgrSharedRelationPool; + +static SMgrSharedRelationPool *sr_pool; +static HTAB *sr_mapping_table; /* * This struct of function pointers defines the API between smgr.c and @@ -98,6 +153,417 @@ static dlist_head unowned_relns; /* local function prototypes */ static void smgrshutdown(int code, Datum arg); +/* GUCs. */ +int smgr_shared_relations = 1000; + +/* + * Try to get the size of a relation's fork by looking it up in the mapping + * table with a shared lock. This will succeed if the SMgrRelation already + * exists. + */ +static BlockNumber +smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum) +{ + SMgrSharedRelationMapping *mapping; + SMgrSharedRelation *sr; + uint32 hash; + LWLock *mapping_lock; + BlockNumber result = InvalidBlockNumber; + + hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode); + mapping_lock = SR_PARTITION_LOCK(hash); + + LWLockAcquire(mapping_lock, LW_SHARED); + mapping = hash_search_with_hash_value(sr_mapping_table, + &reln->smgr_rnode, + hash, + HASH_FIND, + NULL); + if (mapping) + { + sr = &sr_pool->objects[mapping->index]; + result = sr->nblocks[forknum]; + } + LWLockRelease(mapping_lock); + + return result; +} + +/* + * Lock a SMgrSharedRelation. The lock is a spinlock that should be held for + * only a few instructions. The return value is the current set of flags, + * which may be modified and then passed to smgr_unlock_sr() to be atomically + * when the lock is released. + */ +static uint32 +smgr_lock_sr(SMgrSharedRelation *sr) +{ + + for (;;) + { + uint32 old_flags = pg_atomic_read_u32(&sr->flags); + uint32 flags; + + if (!(old_flags & SR_LOCKED)) + { + flags = old_flags | SR_LOCKED; + if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags)) + return flags; + } + } + return 0; /* unreachable */ +} + +/* + * Unlock a SMgrSharedRelation, atomically updating its flags at the same + * time. + */ +static void +smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags) +{ + pg_write_barrier(); + pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED); +} + +/* + * Allocate a new invalid SMgrSharedRelation, and return it locked. + * + * The replacement algorithm is a simple FIFO design with no second chance for + * now. + */ +static SMgrSharedRelation * +smgr_alloc_sr(void) +{ + SMgrSharedRelationMapping *mapping; + SMgrSharedRelation *sr; + uint32 index; + LWLock *mapping_lock; + uint32 flags; + RelFileNodeBackend rnode; + uint32 hash; + + retry: + /* Lock the next one in clock-hand order. */ + index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations; + sr = &sr_pool->objects[index]; + flags = smgr_lock_sr(sr); + + /* If it's unused, can return it, still locked, immediately. */ + if (!(flags & SR_VALID)) + return sr; + + /* + * Copy the rnode and unlock. We'll briefly acquire both mapping and SR + * locks, but we need to do it in that order, so we'll unlock the SR + * first. + */ + rnode = sr->rnode; + smgr_unlock_sr(sr, flags); + + hash = get_hash_value(sr_mapping_table, &rnode); + mapping_lock = SR_PARTITION_LOCK(hash); + + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); + mapping = hash_search_with_hash_value(sr_mapping_table, + &rnode, + hash, + HASH_FIND, + NULL); + if (!mapping || mapping->index != index) + { + /* Too slow, it's gone or now points somewhere else. Go around. */ + LWLockRelease(mapping_lock); + goto retry; + } + + /* We will lock the SR for just a few instructions. */ + flags = smgr_lock_sr(sr); + Assert(flags & SR_VALID); + + /* + * If another backend is currently syncing any fork, we aren't allowed to + * evict it, and waiting for it would be pointless because that other + * backend already plans to allocate it. So go around. + */ + if (flags & SR_SYNCING_MASK) + { + smgr_unlock_sr(sr, flags); + LWLockRelease(mapping_lock); + goto retry; + } + + /* + * We will sync every fork that is dirty, and then we'll try to + * evict it. + */ + while (flags & SR_DIRTY_MASK) + { + SMgrRelation reln; + ForkNumber forknum = SR_GET_ONE_DIRTY(flags); + + /* Set the sync bit, clear the just-dirtied bit and unlock. */ + flags |= SR_SYNCING(forknum); + flags &= ~SR_JUST_DIRTIED(forknum); + smgr_unlock_sr(sr, flags); + LWLockRelease(mapping_lock); + + /* + * Perform the I/O, with no locks held. + * XXX It sucks that we fsync every segment, not just the ones that need it... + */ + reln = smgropen(rnode.node, rnode.backend); + smgrimmedsync(reln, forknum); + + /* + * Reacquire the locks. The object can't have been evicted, + * because we set a sync bit. + */ + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); + flags = smgr_lock_sr(sr); + Assert(flags & SR_SYNCING(forknum)); + flags &= ~SR_SYNCING(forknum); + if (flags & SR_JUST_DIRTIED(forknum)) + { + /* + * Someone else dirtied it while we were syncing, so we can't mark + * it clean. Let's give up on this SR and go around again. + */ + smgr_unlock_sr(sr, flags); + LWLockRelease(mapping_lock); + goto retry; + } + + /* This fork is clean! */ + flags &= ~SR_DIRTY(forknum); + } + + /* + * If we made it this far, there are no dirty forks, so we're now allowed + * to evict the SR from the pool and the mapping table. + */ + flags &= ~SR_VALID; + smgr_unlock_sr(sr, flags); + + /* + * If any callers to smgr_sr_drop() or smgr_sr_drop_db() had the misfortune + * to have to wait for us to finish syncing, we can now wake them up. + */ + ConditionVariableBroadcast(&sr_pool->sync_flags_cleared); + + /* Remove from the mapping table. */ + hash_search_with_hash_value(sr_mapping_table, + &rnode, + hash, + HASH_REMOVE, + NULL); + LWLockRelease(mapping_lock); + + /* + * XXX: We unlock while doing HASH_REMOVE on principle. Maybe it'd be OK + * to hold it now that the clock hand is far away and there is no way + * anyone can look up this SR through buffer mapping table. + */ + flags = smgr_lock_sr(sr); + if (flags & SR_VALID) + { + /* Oops, someone else got it. */ + smgr_unlock_sr(sr, flags); + goto retry; + } + + return sr; +} + +/* + * Set the number of blocks in a relation, in shared memory, and optionally + * also mark the relation as "dirty" (meaning the it must be fsync'd before it + * can be evicted). + */ +static void +smgrnblocks_update(SMgrRelation reln, + ForkNumber forknum, + BlockNumber nblocks, + bool mark_dirty) +{ + SMgrSharedRelationMapping *mapping; + SMgrSharedRelation *sr = NULL; + uint32 hash; + LWLock *mapping_lock; + uint32 flags; + + hash = get_hash_value(sr_mapping_table, &reln->smgr_rnode); + mapping_lock = SR_PARTITION_LOCK(hash); + + retry: + LWLockAcquire(mapping_lock, LW_SHARED); + mapping = hash_search_with_hash_value(sr_mapping_table, + &reln->smgr_rnode, + hash, + HASH_FIND, + NULL); + if (mapping) + { + sr = &sr_pool->objects[mapping->index]; + flags = smgr_lock_sr(sr); + if (mark_dirty) + { + /* + * Extend and truncate clobber the value, and there are no races + * to worry about because they can have higher level exclusive + * locking on the relation. + */ + sr->nblocks[forknum] = nblocks; + + /* + * Mark it dirty, and if it's currently being sync'd, make sure it + * stays dirty after that completes. + */ + flags |= SR_DIRTY(forknum); + if (flags & SR_SYNCING(forknum)) + flags |= SR_JUST_DIRTIED(forknum); + } + else if (!(flags & SR_DIRTY(forknum))) + { + /* + * We won't clobber a dirty value with a non-dirty update, to + * avoid races against concurrent extend/truncate, but we can + * install a new clean value. + */ + sr->nblocks[forknum] = nblocks; + } + smgr_unlock_sr(sr, flags); + } + LWLockRelease(mapping_lock); + + /* If we didn't find it, then we'll need to allocate one. */ + if (!sr) + { + bool found; + + sr = smgr_alloc_sr(); + + /* Upgrade to exclusive lock so we can create a mapping. */ + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); + mapping = hash_search_with_hash_value(sr_mapping_table, + &reln->smgr_rnode, + hash, + HASH_ENTER, + &found); + if (!found) + { + /* Success! Initialize. */ + mapping->index = sr - sr_pool->objects; + smgr_unlock_sr(sr, SR_VALID); + sr->rnode = reln->smgr_rnode; + for (int i = 0; i <= MAX_FORKNUM; ++i) + sr->nblocks[i] = InvalidBlockNumber; + LWLockRelease(mapping_lock); + } + else + { + /* Someone beat us to it. Go around again. */ + smgr_unlock_sr(sr, 0); /* = not valid */ + LWLockRelease(mapping_lock); + goto retry; + } + } +} + +static void +smgr_drop_sr(RelFileNodeBackend *rnode) +{ + SMgrSharedRelationMapping *mapping; + SMgrSharedRelation *sr; + uint32 hash; + LWLock *mapping_lock; + uint32 flags; + + hash = get_hash_value(sr_mapping_table, rnode); + mapping_lock = SR_PARTITION_LOCK(hash); + +retry: + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); + mapping = hash_search_with_hash_value(sr_mapping_table, + rnode, + hash, + HASH_FIND, + NULL); + if (mapping) + { + sr = &sr_pool->objects[mapping->index]; + + flags = smgr_lock_sr(sr); + Assert(flags & SR_VALID); + + if (flags & SR_SYNCING_MASK) + { + /* + * Oops, someone's syncing one of its forks; nothing to do but + * wait until that's finished. + */ + smgr_unlock_sr(sr, flags); + LWLockRelease(mapping_lock); + ConditionVariableSleep(&sr_pool->sync_flags_cleared, + WAIT_EVENT_SMGR_DROP_SYNC); + goto retry; + } + ConditionVariableCancelSleep(); + + /* Mark it invalid and drop the mapping. */ + smgr_unlock_sr(sr, ~SR_VALID); + hash_search_with_hash_value(sr_mapping_table, + rnode, + hash, + HASH_REMOVE, + NULL); + } + LWLockRelease(mapping_lock); +} + +size_t +smgr_shmem_size(void) +{ + size_t size = 0; + + size = add_size(size, + sizeof(offsetof(SMgrSharedRelationPool, objects) + + sizeof(SMgrSharedRelation) * smgr_shared_relations)); + size = add_size(size, + hash_estimate_size(smgr_shared_relations, + sizeof(SMgrSharedRelationMapping))); + + return size; +} + +void +smgr_shmem_init(void) +{ + HASHCTL info; + bool found; + + info.keysize = sizeof(RelFileNodeBackend); + info.entrysize = sizeof(SMgrSharedRelationMapping); + info.num_partitions = SR_PARTITIONS; + sr_mapping_table = ShmemInitHash("SMgrSharedRelation Mapping Table", + smgr_shared_relations, + smgr_shared_relations, + &info, + HASH_ELEM | HASH_BLOBS | HASH_PARTITION); + + sr_pool = ShmemInitStruct("SMgrSharedRelation Pool", + offsetof(SMgrSharedRelationPool, objects) + + sizeof(SMgrSharedRelation) * smgr_shared_relations, + &found); + if (!found) + { + ConditionVariableInit(&sr_pool->sync_flags_cleared); + pg_atomic_init_u32(&sr_pool->next, 0); + for (uint32 i = 0; i < smgr_shared_relations; ++i) + { + pg_atomic_init_u32(&sr_pool->objects[i].flags, 0); + } + } +} /* * smgrinit(), smgrshutdown() -- Initialize or shut down storage @@ -174,8 +640,6 @@ smgropen(RelFileNode rnode, BackendId backend) /* hash_search already filled in the lookup key */ reln->smgr_owner = NULL; reln->smgr_targblock = InvalidBlockNumber; - for (int i = 0; i <= MAX_FORKNUM; ++i) - reln->smgr_cached_nblocks[i] = InvalidBlockNumber; reln->smgr_which = 0; /* we only have md.c at present */ /* implementation-specific initialization */ @@ -246,6 +710,9 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln) bool smgrexists(SMgrRelation reln, ForkNumber forknum) { + if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber) + return true; + return smgrsw[reln->smgr_which].smgr_exists(reln, forknum); } @@ -429,6 +896,9 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo) for (i = 0; i < nrels; i++) CacheInvalidateSmgr(rnodes[i]); + for (i = 0; i < nrels; i++) + smgr_drop_sr(&rels[i]->smgr_rnode); + /* * Delete the physical file(s). * @@ -464,16 +934,7 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, { smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum, buffer, skipFsync); - - /* - * Normally we expect this to increase nblocks by one, but if the cached - * value isn't as expected, just invalidate it so the next call asks the - * kernel. - */ - if (reln->smgr_cached_nblocks[forknum] == blocknum) - reln->smgr_cached_nblocks[forknum] = blocknum + 1; - else - reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber; + smgrnblocks_update(reln, forknum, blocknum + 1, true); } /* @@ -549,16 +1010,16 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum) { BlockNumber result; - /* - * For now, we only use cached values in recovery due to lack of a shared - * invalidation mechanism for changes in file size. - */ - if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber) - return reln->smgr_cached_nblocks[forknum]; + /* Can we get the answer from shared memory with only a share lock? */ + result = smgrnblocks_shared(reln, forknum); + if (result != InvalidBlockNumber) + return result; + /* Ask the kernel. */ result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum); - reln->smgr_cached_nblocks[forknum] = result; + /* Update the value in shared memory for faster service next time. */ + smgrnblocks_update(reln, forknum, result, false); return result; } @@ -599,19 +1060,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nb /* Do the truncation */ for (i = 0; i < nforks; i++) { - /* Make the cached size is invalid if we encounter an error. */ - reln->smgr_cached_nblocks[forknum[i]] = InvalidBlockNumber; - smgrsw[reln->smgr_which].smgr_truncate(reln, forknum[i], nblocks[i]); - - /* - * We might as well update the local smgr_cached_nblocks values. The - * smgr cache inval message that this function sent will cause other - * backends to invalidate their copies of smgr_fsm_nblocks and - * smgr_vm_nblocks, and these ones too at the next command boundary. - * But these ensure they aren't outright wrong until then. - */ - reln->smgr_cached_nblocks[forknum[i]] = nblocks[i]; + smgrnblocks_update(reln, forknum[i], nblocks[i], true); } } @@ -644,6 +1094,34 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +/* + * When a database is dropped, we have to find and throw away all its + * SMgrSharedRelation objects. + */ +void +smgrdropdb(Oid database) +{ + for (int i = 0; i < smgr_shared_relations; ++i) + { + SMgrSharedRelation *sr = &sr_pool->objects[i]; + RelFileNodeBackend rnode; + uint32 flags; + + /* Hold the spinlock only while we copy out the rnode of matches. */ + flags = smgr_lock_sr(sr); + if ((flags & SR_VALID) && sr->rnode.node.dbNode == database) + { + rnode = sr->rnode; + smgr_unlock_sr(sr, flags); + + /* Drop, if it's still valid. */ + smgr_drop_sr(&rnode); + } + else + smgr_unlock_sr(sr, flags); + } +} + /* * AtEOXact_SMgr * diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 878fcc2236..b548b4fd96 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -82,6 +82,7 @@ #include "storage/pg_shmem.h" #include "storage/predicate.h" #include "storage/proc.h" +#include "storage/smgr.h" #include "storage/standby.h" #include "tcop/tcopprot.h" #include "tsearch/ts_cache.h" @@ -2435,6 +2436,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"smgr_shared_relations", PGC_POSTMASTER, RESOURCES_MEM, + gettext_noop("Sets the number of shared relation objects in memory at one time."), + NULL + }, + &smgr_shared_relations, + 1000, 64, INT_MAX, + NULL, NULL, NULL + }, + /* * See also CheckRequiredParameterValues() if this parameter changes */ diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 5954068dec..d13cc20279 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -963,6 +963,7 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, + WAIT_EVENT_SMGR_DROP_SYNC, WAIT_EVENT_SYNC_REP, WAIT_EVENT_XACT_GROUP_UPDATE } WaitEventIPC; diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index f28a842401..7031c70196 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,13 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* GUCs. */ +extern int smgr_shared_relations; + +/* Definition private to smgr.c. */ +struct SMgrSharedRelation; +typedef struct SMgrSharedRelation SMgrSharedRelation; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -44,14 +51,7 @@ typedef struct SMgrRelationData /* pointer to owning pointer, or NULL if none */ struct SMgrRelationData **smgr_owner; - /* - * The following fields are reset to InvalidBlockNumber upon a cache flush - * event, and hold the last known size for each fork. This information is - * currently only reliable during recovery, since there is no cache - * invalidation for fork extension. - */ BlockNumber smgr_targblock; /* current insertion target block */ - BlockNumber smgr_cached_nblocks[MAX_FORKNUM + 1]; /* last known size */ /* additional public fields may someday exist here */ @@ -77,6 +77,9 @@ typedef SMgrRelationData *SMgrRelation; #define SmgrIsTemp(smgr) \ RelFileNodeBackendIsTemp((smgr)->smgr_rnode) +extern size_t smgr_shmem_size(void); +extern void smgr_shmem_init(void); + extern void smgrinit(void); extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend); extern bool smgrexists(SMgrRelation reln, ForkNumber forknum); @@ -102,6 +105,7 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum); extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, BlockNumber *nblocks); extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void smgrdropdb(Oid database); extern void AtEOXact_SMgr(void); #endif /* SMGR_H */ -- 2.20.1
From 2c3c98711bdadd0846c4dfff3aea6cdf84f6df44 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Thu, 19 Nov 2020 17:09:51 +1300 Subject: [PATCH v3 2/2] WIP: Provide a lock-free fast path for smgrnblocks(). MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SMgrRelation objects gain a pointer to the last known SMgrSharedRelation object. There are three concurrency hazards to worry about: 1. The SMgrSharedRelation pointed to could be evicted at any time. We record a generation number, and insert memory barriers so that we can detect that and fall back to a slower path. 2. The nblocks value is read without any locking, which is atomic because it is a 32 bit value, and PostgreSQL requires atomic 32 bit reads generally. 3. The nblocks value must be fresh enough for scans, extension, truncatation and dropping buffers, because the those operations all executed memory barriers when it acquired a snapshot to scan (which doesn't need to see blocks added after that) or an exclusive heavyweight lock to extend, truncate or drop. XXX right? XXX That's the idea anyway, but this is just a sketch; almost certainly incomplet and inkorrect. Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com Reviewed-by: 陈佳昕(步真) <buzhen....@alibaba-inc.com> Reviewed-by: Andy Fan <zhihui.fan1...@gmail.com> --- src/backend/storage/smgr/smgr.c | 63 +++++++++++++++++++++++++++++++-- src/include/storage/smgr.h | 4 +++ 2 files changed, 65 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 1da2dfc250..5c17e3e74d 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -50,6 +50,7 @@ struct SMgrSharedRelation RelFileNodeBackend rnode; BlockNumber nblocks[MAX_FORKNUM + 1]; pg_atomic_uint32 flags; + pg_atomic_uint64 generation; /* mapping change */ }; /* For now, we borrow the buffer managers array of locks. XXX fixme */ @@ -156,6 +157,40 @@ static void smgrshutdown(int code, Datum arg); /* GUCs. */ int smgr_shared_relations = 1000; +/* + * Try to get the size of a relation's fork without locking. + */ +static BlockNumber +smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum) +{ + SMgrSharedRelation *sr = reln->smgr_shared; + BlockNumber result; + + if (sr) + { + pg_read_barrier(); + + /* We can load int-sized values atomically without special measures. */ + Assert(sizeof(sr->nblocks[forknum]) == sizeof(uint32)); + result = sr->nblocks[forknum]; + + /* + * With a read barrier between the loads, we can check that the object + * still refers to the same rnode before trusting the answer. + */ + pg_read_barrier(); + if (pg_atomic_read_u64(&sr->generation) == reln->smgr_shared_generation) + return result; + + /* + * The generation doesn't match, the shared relation must have been + * evicted since we got a pointer to it. We'll need to do more work. + */ + } + + return InvalidBlockNumber; +} + /* * Try to get the size of a relation's fork by looking it up in the mapping * table with a shared lock. This will succeed if the SMgrRelation already @@ -183,6 +218,10 @@ smgrnblocks_shared(SMgrRelation reln, ForkNumber forknum) { sr = &sr_pool->objects[mapping->index]; result = sr->nblocks[forknum]; + + /* We can take the fast path until this SR is eventually evicted. */ + reln->smgr_shared = sr; + reln->smgr_shared_generation = pg_atomic_read_u64(&sr->generation); } LWLockRelease(mapping_lock); @@ -339,9 +378,14 @@ smgr_alloc_sr(void) /* * If we made it this far, there are no dirty forks, so we're now allowed - * to evict the SR from the pool and the mapping table. + * to evict the SR from the pool and the mapping table. Make sure that + * smgrnblocks_fast() sees that its pointer is now invalid by bumping the + * generation. */ flags &= ~SR_VALID; + pg_atomic_write_u64(&sr->generation, + pg_atomic_read_u64(&sr->generation) + 1); + pg_write_barrier(); smgr_unlock_sr(sr, flags); /* @@ -455,6 +499,8 @@ smgrnblocks_update(SMgrRelation reln, mapping->index = sr - sr_pool->objects; smgr_unlock_sr(sr, SR_VALID); sr->rnode = reln->smgr_rnode; + pg_atomic_write_u64(&sr->generation, + pg_atomic_read_u64(&sr->generation) + 1); for (int i = 0; i <= MAX_FORKNUM; ++i) sr->nblocks[i] = InvalidBlockNumber; LWLockRelease(mapping_lock); @@ -509,6 +555,11 @@ retry: } ConditionVariableCancelSleep(); + /* Make sure smgrnblocks_fast() knows it's invalidated. */ + pg_atomic_write_u64(&sr->generation, + pg_atomic_read_u64(&sr->generation) + 1); + pg_write_barrier(); + /* Mark it invalid and drop the mapping. */ smgr_unlock_sr(sr, ~SR_VALID); hash_search_with_hash_value(sr_mapping_table, @@ -561,6 +612,7 @@ smgr_shmem_init(void) for (uint32 i = 0; i < smgr_shared_relations; ++i) { pg_atomic_init_u32(&sr_pool->objects[i].flags, 0); + pg_atomic_init_u64(&sr_pool->objects[i].generation, 0); } } } @@ -640,6 +692,8 @@ smgropen(RelFileNode rnode, BackendId backend) /* hash_search already filled in the lookup key */ reln->smgr_owner = NULL; reln->smgr_targblock = InvalidBlockNumber; + reln->smgr_shared = NULL; + reln->smgr_shared_generation = 0; reln->smgr_which = 0; /* we only have md.c at present */ /* implementation-specific initialization */ @@ -710,7 +764,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln) bool smgrexists(SMgrRelation reln, ForkNumber forknum) { - if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber) + if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber) return true; return smgrsw[reln->smgr_which].smgr_exists(reln, forknum); @@ -1010,6 +1064,11 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum) { BlockNumber result; + /* Can we get the answer from shared memory without locking? */ + result = smgrnblocks_fast(reln, forknum); + if (result != InvalidBlockNumber) + return result; + /* Can we get the answer from shared memory with only a share lock? */ result = smgrnblocks_shared(reln, forknum); if (result != InvalidBlockNumber) diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 7031c70196..fec13cb7bc 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -51,6 +51,10 @@ typedef struct SMgrRelationData /* pointer to owning pointer, or NULL if none */ struct SMgrRelationData **smgr_owner; + /* pointer to shared object, valid if non-NULL and generation matches */ + SMgrSharedRelation *smgr_shared; + uint64 smgr_shared_generation; + BlockNumber smgr_targblock; /* current insertion target block */ /* additional public fields may someday exist here */ -- 2.20.1