Hi, I realize that this might not be the absolutely newest version of the undo storage part of this patchset - but I'm trying to understand the whole context, and that's hard without reading through the whole stack in a situation where the layers actually fit together
On 2019-07-29 01:48:30 -0700, Andres Freund wrote: > < more tomorrow > > + /* Move the high log number pointer past this one. */ > + ++UndoLogShared->next_logno; Fwiw, I find having "next" and "low" as variable names, and then describing "next" as high in comments somewhat confusing. > +/* check_hook: validate new undo_tablespaces */ > +bool > +check_undo_tablespaces(char **newval, void **extra, GucSource source) > +{ > + char *rawname; > + List *namelist; > + > + /* Need a modifiable copy of string */ > + rawname = pstrdup(*newval); > + > + /* > + * Parse string into list of identifiers, just to check for > + * well-formedness (unfortunateley we can't validate the names in the > + * catalog yet). > + */ > + if (!SplitIdentifierString(rawname, ',', &namelist)) > + { > + /* syntax error in name list */ > + GUC_check_errdetail("List syntax is invalid."); > + pfree(rawname); > + list_free(namelist); > + return false; > + } Why can't you validate the catalog here? In a lot of cases this will be called in a transaction, especially when changing it in a session. E.g. temp_tablespaces does so? > + /* > + * Make sure we aren't already in a transaction that has been assigned > an > + * XID. This ensures we don't detach from an undo log that we might > have > + * started writing undo data into for this transaction. > + */ > + if (GetTopTransactionIdIfAny() != InvalidTransactionId) > + ereport(ERROR, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + (errmsg("undo_tablespaces cannot be changed > while a transaction is in progress")))); Hm. Is this really a great proxy? Seems like it'll block changing the tablespace unnecessarily in a lot of situations, and like there could even be holes in the future - it doesn't seem crazy that we'd want to emit undo without assigning an xid in some situations (e.g. for deleting files in error cases, or for more aggressive cleanup of dead index entries during reads or such). It seems like it'd be pretty easy to just check CurrentSession->attached_undo_slots[i].slot->meta.unlogged.this_xact_start or such? > +static bool > +choose_undo_tablespace(bool force_detach, Oid *tablespace) > +{ > + else > + { > + /* > + * Choose an OID using our pid, so that if several backends > have the > + * same multi-tablespace setting they'll spread out. We could > easily > + * do better than this if more serious load balancing is judged > + * useful. > + */ We're not really choosing an oid, we're choosing a tablespace. Obviously one can understand it as is, but it confused me for a second. > + int index = MyProcPid % length; Hm. Is MyProcPid a good proxy here? Wouldn't it be better to use MyProc->pgprocno or such? That's much more guaranteed to space out somewhat evenly? > + int first_index = index; > + Oid oid = InvalidOid; > + > + /* > + * Take the tablespace create/drop lock while we look the name > up. > + * This prevents the tablespace from being dropped while we're > trying > + * to resolve the name, or while the called is trying to create > an > + * undo log in it. The caller will have to release this lock. > + */ > + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); Why exclusive? I think any function that acquires a lock it doesn't release (or the reverse) ought to have a big honking comment in its header warning of that. And an explanation as to why that is. > + for (;;) > + { > + const char *name = list_nth(namelist, index); > + > + oid = get_tablespace_oid(name, true); > + if (oid == InvalidOid) > + { > + /* Unknown tablespace, try the next one. */ > + index = (index + 1) % length; > + /* > + * But if we've tried them all, it's time to > complain. We'll > + * arbitrarily complain about the last one we > tried in the > + * error message. > + */ > + if (index == first_index) > + ereport(ERROR, > + > (errcode(ERRCODE_UNDEFINED_OBJECT), > + errmsg("tablespace > \"%s\" does not exist", name), > + errhint("Create the > tablespace or set undo_tablespaces to a valid or empty list."))); > + continue; Wouldn't it be better to simply include undo_tablespaces in the error messages? Something roughly like 'none of the tablespaces in undo_tablespaces = \"%s\" exists"? > + /* > + * If we came here because the user changed undo_tablesaces, then detach > + * from any undo logs we happen to be attached to. > + */ > + if (force_detach) > + { > + for (i = 0; i < UndoPersistenceLevels; ++i) > + { > + UndoLogSlot *slot = > CurrentSession->attached_undo_slots[i]; > + > + if (slot != NULL) > + { > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + slot->pid = InvalidPid; > + slot->meta.unlogged.xid = InvalidTransactionId; > + LWLockRelease(&slot->mutex); Would it make sense to re-assert here that the current transaction didn't write undo? > +bool > +DropUndoLogsInTablespace(Oid tablespace) > +{ > + DIR *dir; > + char undo_path[MAXPGPATH]; > + UndoLogSlot *slot = NULL; > + int i; > + > + Assert(LWLockHeldByMe(TablespaceCreateLock)); IMO this ought to be mentioned in a function header comment. > + /* First, try to kick everyone off any undo logs in this tablespace. */ > + while ((slot = UndoLogNextSlot(slot))) > + { > + bool ok; > + bool return_to_freelist = false; > + > + /* Skip undo logs in other tablespaces. */ > + if (slot->meta.tablespace != tablespace) > + continue; > + > + /* Check if this undo log can be forcibly detached. */ > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + if (slot->meta.discard == slot->meta.unlogged.insert && > + (slot->meta.unlogged.xid == InvalidTransactionId || > + !TransactionIdIsInProgress(slot->meta.unlogged.xid))) > + { Not everyone will agree, but this looks complicated enough that I'd put it just in a simple wrapper function. If this were if (CanDetachUndoForcibly(slot)) you'd not need a comment either... Also, isn't the slot->meta.discard == slot->meta.unlogged.insert a separate concern from detaching? My understanding is that it'll be perfectly normal to have undo logs with undiscarded data that nobody is attached to? In fact, I got confused below, because I initially didn't spot any place that implemented the check referenced in the caller: > + * Drop the undo logs in this tablespace. This will fail (without > + * dropping anything) if there are undo logs that we can't afford to > drop > + * because they contain non-discarded data or a transaction is in > + * progress. Since we hold TablespaceCreateLock, no other session will > be > + * able to attach to an undo log in this tablespace (or any tablespace > + * except default) concurrently. > + */ > + if (!DropUndoLogsInTablespace(tablespaceoid)) > + ereport(ERROR, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("tablespace \"%s\" cannot be dropped > because it contains non-empty undo logs", > + tablespacename))); > + > + /* > + else > + { > + /* > + * There is data we need in this undo log. We can't > force it to > + * be detached. > + */ > + ok = false; > + } Seems like we ought to return more information here. An error message like: > /* > + * Drop the undo logs in this tablespace. This will fail (without > + * dropping anything) if there are undo logs that we can't afford to > drop > + * because they contain non-discarded data or a transaction is in > + * progress. Since we hold TablespaceCreateLock, no other session will > be > + * able to attach to an undo log in this tablespace (or any tablespace > + * except default) concurrently. > + */ > + if (!DropUndoLogsInTablespace(tablespaceoid)) > + ereport(ERROR, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("tablespace \"%s\" cannot be dropped > because it contains non-empty undo logs", > + tablespacename))); doesn't really allow a DBA to do anything about the issue. Seems we ought to at least include the pid in the error message? I'd perhaps just move the error message from DropTableSpace() into DropUndoLogsInTablespace(). I don't think that's worse from a layering perspective, and allows to raise a more precise error, and simplifies the API. > + /* > + * Put this undo log back on the appropriate free-list. No one > can > + * attach to it while we hold TablespaceCreateLock, but if we > return > + * earlier in a future go around this loop, we need the undo > log to > + * remain usable. We'll remove all appropriate logs from the > + * free-lists in a separate step below. > + */ > + if (return_to_freelist) > + { > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + slot->next_free = > UndoLogShared->free_lists[slot->meta.persistence]; > + UndoLogShared->free_lists[slot->meta.persistence] = > slot->logno; > + LWLockRelease(UndoLogLock); > + } There's multiple places that put logs onto the freelist. I'd put them into one small function. Not primarily because it'll be easier to read, but because it makes it easier to search for places that do so. > + /* > + * We detached all backends from undo logs in this tablespace, and no > one > + * can attach to any non-default-tablespace undo logs while we hold > + * TablespaceCreateLock. We can now drop the undo logs. > + */ > + slot = NULL; > + while ((slot = UndoLogNextSlot(slot))) > + { > + /* Skip undo logs in other tablespaces. */ > + if (slot->meta.tablespace != tablespace) > + continue; > + > + /* > + * Make sure no buffers remain. When that is done by > + * UndoLogDiscard(), the final page is left in shared_buffers > because > + * it may contain data, or at least be needed again very soon. > Here > + * we need to drop even that page from the buffer pool. > + */ > + forget_undo_buffers(slot->logno, slot->meta.discard, > slot->meta.discard, true); > + > + /* > + * TODO: For now we drop the undo log, meaning that it will > never be > + * used again. That wastes the rest of its address space. > Instead, > + * we should put it onto a special list of 'offline' undo logs, > ready > + * to be reactivated in some other tablespace. Then we can > keep the > + * unused portion of its address space. > + */ > + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE); > + slot->meta.status = UNDO_LOG_STATUS_DISCARDED; > + LWLockRelease(&slot->mutex); > + } Before I looked up forget_undo_buffers()'s implementation I wrote: Hm. Iterating through shared buffers several times, especially when there possibly could be a good sized numbers of undo logs, seems a bit superfluous. This probably isn't going to be that frequently used in practice, so it's perhaps ok. But it seems like this might be used when things are bad (i.e. there's a lot of UNDO). But I still wonder about that. Especially when there's a lot of UNDO (most of it not in shared buffers), this could end up doing a *crapton* of buffer lookups. I'm inclined to think that this case - probably in contrast to the discard case, would be better served using DropRelFileNodeBuffers(). > + /* Remove all dropped undo logs from the free-lists. */ > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + for (i = 0; i < UndoPersistenceLevels; ++i) > + { > + UndoLogSlot *slot; > + UndoLogNumber *place; > + > + place = &UndoLogShared->free_lists[i]; > + while (*place != InvalidUndoLogNumber) > + { > + slot = find_undo_log_slot(*place, true); > + if (!slot) > + elog(ERROR, > + "corrupted undo log freelist, unknown > log %u", *place); > + if (slot->meta.status == UNDO_LOG_STATUS_DISCARDED) > + *place = slot->next_free; > + else > + place = &slot->next_free; > + } > + } > + LWLockRelease(UndoLogLock); Hm, shouldn't this check that the log is actually in the being-dropped tablespace? > +void > +ResetUndoLogs(UndoPersistence persistence) > +{ This imo ought to explain why one would want/need to do that. As far as I can tell this implementation for example wouldn't be correct in all that many situations, because it e.g. doesn't drop the relevant buffers? Seems like this would need to assert that persistence isn't PERMANENT? This is made more "problematic" by the fact that there's no caller for this in this commit, only being used much later in the series. But I think the comment should be there anyway. Hard to review (and understand) otherwise. Why is it correct not to take any locks here? The caller in 0014 afaict is when we're already in hot standby, which means people will possibly read undo? > + UndoLogSlot *slot = NULL; > + > + while ((slot = UndoLogNextSlot(slot))) > + { > + DIR *dir; > + struct dirent *de; > + char undo_path[MAXPGPATH]; > + char segment_prefix[MAXPGPATH]; > + size_t segment_prefix_size; > + > + if (slot->meta.persistence != persistence) > + continue; > + > + /* Scan the directory for files belonging to this undo log. */ > + snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", > slot->logno); > + segment_prefix_size = strlen(segment_prefix); > + UndoLogDirectory(slot->meta.tablespace, undo_path); > + dir = AllocateDir(undo_path); > + if (dir == NULL) > + continue; > + while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL) > + { > + char segment_path[MAXPGPATH]; > + > + if (strncmp(de->d_name, segment_prefix, > segment_prefix_size) != 0) > + continue; I'm perfectly fine with using MAXPGPATH buffers. But I do find it confusing that in some places you're using dynamic allocations (in some cases quite repeatedly, like in allocate_empty_undo_segment(), but here you don't? Hm, isn't this kinda O(#slot*#total_size_of_undo) due to going over the whole tablespace for each log? > + snprintf(segment_path, sizeof(segment_path), "%s/%s", > + undo_path, de->d_name); > + elog(DEBUG1, "unlinked undo segment \"%s\"", > segment_path); > + if (unlink(segment_path) < 0) > + elog(LOG, "couldn't unlink file \"%s\": %m", > segment_path); > + } > + FreeDir(dir); I think the LOG should be done alternatively do to the DEBUG1, otherwise it's going to be confusing. Should this really only be a LOG? Going to be hard to cleanup for a DBA later. > +Datum > +pg_stat_get_undo_logs(PG_FUNCTION_ARGS) > +{ > +#define PG_STAT_GET_UNDO_LOGS_COLS 9 > + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; > + TupleDesc tupdesc; > + Tuplestorestate *tupstore; > + MemoryContext per_query_ctx; > + MemoryContext oldcontext; > + char *tablespace_name = NULL; > + Oid last_tablespace = InvalidOid; > + int i; > + > + /* check to see if caller supports us returning a tuplestore */ > + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("set-valued function called in context > that cannot accept a set"))); > + if (!(rsinfo->allowedModes & SFRM_Materialize)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("materialize mode required, but it is > not " \ > + "allowed in this context"))); > + > + /* Build a tuple descriptor for our result type */ > + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) > + elog(ERROR, "return type must be a row type"); I wish we'd encapsulate this in one place instead of copying it over and over. Imo it's bad style to break error messages over multiple lines, makes it harder to grep for. > + /* Scan all undo logs to build the results. */ > + for (i = 0; i < UndoLogShared->nslots; ++i) > + { > + UndoLogSlot *slot = &UndoLogShared->slots[i]; > + char buffer[17]; > + Datum values[PG_STAT_GET_UNDO_LOGS_COLS]; > + bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false }; > + Oid tablespace; Uncommented numbers like '17' for buffer lengths make me nervous. > + values[0] = ObjectIdGetDatum((Oid) slot->logno); > + values[1] = CStringGetTextDatum( > + slot->meta.persistence == UNDO_PERMANENT ? "permanent" : > + slot->meta.persistence == UNDO_UNLOGGED ? "unlogged" : > + slot->meta.persistence == UNDO_TEMP ? "temporary" : > "<uknown>"); s/uknown/unknown/ > + tablespace = slot->meta.tablespace; > + > + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, > + MakeUndoRecPtr(slot->logno, > slot->meta.discard)); > + values[3] = CStringGetTextDatum(buffer); > + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, > + MakeUndoRecPtr(slot->logno, > slot->meta.unlogged.insert)); > + values[4] = CStringGetTextDatum(buffer); > + snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, > + MakeUndoRecPtr(slot->logno, slot->meta.end)); > + values[5] = CStringGetTextDatum(buffer); Makes me wonder if we shouldn't have a type for undo pointers. > + if (slot->meta.unlogged.xid == InvalidTransactionId) > + nulls[6] = true; > + else > + values[6] = > TransactionIdGetDatum(slot->meta.unlogged.xid); > + if (slot->pid == InvalidPid) > + nulls[7] = true; > + else > + values[7] = Int32GetDatum((int32) slot->pid); > + switch (slot->meta.status) > + { > + case UNDO_LOG_STATUS_ACTIVE: > + values[8] = CStringGetTextDatum("ACTIVE"); break; > + case UNDO_LOG_STATUS_FULL: > + values[8] = CStringGetTextDatum("FULL"); break; > + default: > + nulls[8] = true; > + } Don't think this'll survive pgindent. > + /* > + * Deal with potentially slow tablespace name lookup without > the lock. > + * Avoid making multiple calls to that expensive function for > the > + * common case of repeating tablespace. > + */ > + if (tablespace != last_tablespace) > + { > + if (tablespace_name) > + pfree(tablespace_name); > + tablespace_name = get_tablespace_name(tablespace); > + last_tablespace = tablespace; > + } If we need to do this repeatedly, I think we ought to add a syscache for tablespace names. > + if (tablespace_name) > + { > + values[2] = CStringGetTextDatum(tablespace_name); > + nulls[2] = false; > + } > + else > + nulls[2] = true; > + > + tuplestore_putvalues(tupstore, tupdesc, values, nulls); Seems like a CHECK_FOR_INTERRUPTS() in this loop wouldn't hurt. > + } > + > + if (tablespace_name) > + pfree(tablespace_name); That seems a bit superflous, given we're leaking plenty other memory (which is perfectly fine). > +/* > + * replay the creation of a new undo log > + */ > +static void > +undolog_xlog_create(XLogReaderState *record) > +{ > + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record); > + UndoLogSlot *slot; > + > + /* Create meta-data space in shared memory. */ > + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); > + > + /* TODO: assert that it doesn't exist already? */ > + slot = allocate_undo_log_slot(); Doesn't this need some error checking? allocate_undo_log_slot() will return NULL if there's no slots left. E.g. restarting a server with a lower max_connections could have one run into this easily? > +/* > + * Drop all buffers for the given undo log, from the old_discard to up > + * new_discard. If drop_tail is true, also drop the buffer that holds > + * new_discard; this is used when discarding undo logs completely, for > example > + * via DROP TABLESPACE. If it is false, then the final buffer is not dropped > + * because it may contain data. > + * > + */ > +static void > +forget_undo_buffers(int logno, UndoLogOffset old_discard, > + UndoLogOffset new_discard, bool > drop_tail) > +{ > + BlockNumber old_blockno; > + BlockNumber new_blockno; > + RelFileNode rnode; > + > + UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard)); > + old_blockno = old_discard / BLCKSZ; > + new_blockno = new_discard / BLCKSZ; > + if (drop_tail) > + ++new_blockno; > + while (old_blockno < new_blockno) > + { Hm. This'll be quite bad if you have a lot more undo than shared_buffers. Taking the partition lwlocks this many times will hurt. OTOH, scanning all of shared buffers everytime we truncate a few hundred bytes of undo away is obviously also not going to work. > + ForgetBuffer(SMGR_UNDO, rnode, UndoLogForkNum, old_blockno); > + ForgetLocalBuffer(SMGR_UNDO, rnode, UndoLogForkNum, > old_blockno++); This seems odd to me - why do we need to scan both? We ought to know which one is needed, right? > + } > +} > +/* > + * replay an undo segment discard record > + */ Missing newline between functions. > +static void > +undolog_xlog_discard(XLogReaderState *record) > +{ > + /* > + * We're about to discard undologs. In Hot Standby mode, ensure that > + * there's no queries running which need to get tuple from discarded > undo. nitpick: s/undologs/undo logs/? I think most other comments split it? > + * XXX we are passing empty rnode to the conflict function so that it > can > + * check conflict in all the backend regardless of which database the > + * backend is connected. > + */ > + if (InHotStandby && TransactionIdIsValid(xlrec->latestxid)) > + ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode); Hm. Perhaps it'd be better to change ResolveRecoveryConflictWithSnapshot's API to just accept the database (OTOH, we perhaps ought to be more granular in conflict processing). Or just mention that it's ok to pass in an invalid rnode? > + /* > + * See if we need to unlink or rename any files, but don't consider it > an > + * error if we find that files are missing. Since UndoLogDiscard() > + * performs filesystem operations before WAL logging or updating shmem > + * which could be checkpointed, a crash could have left files already > + * deleted, but we could replay WAL that expects the files to be there. > + */ Or we could have crashed/restarted during WAL replay and processing the same WAL again. Not sure if that's worth mentioning. > + /* Unlink or rename segments that are no longer in range. */ > + while (old_segment_begin < new_segment_begin) > + { > + char discard_path[MAXPGPATH]; > + > + /* Tell the checkpointer that the file is going away. */ > + undofile_forget_sync(slot->logno, > + old_segment_begin / > UndoLogSegmentSize, > + slot->meta.tablespace); > + > + UndoLogSegmentPath(xlrec->logno, old_segment_begin / > UndoLogSegmentSize, > + slot->meta.tablespace, > discard_path); > + > + /* Can we recycle the oldest segment? */ > + if (end < xlrec->end) > + { > + char recycle_path[MAXPGPATH]; > + > + UndoLogSegmentPath(xlrec->logno, end / > UndoLogSegmentSize, > + > slot->meta.tablespace, recycle_path); > + if (rename(discard_path, recycle_path) == 0) > + { > + elog(DEBUG1, "recycled undo segment \"%s\" -> > \"%s\"", > + discard_path, recycle_path); > + end += UndoLogSegmentSize; > + } > + else > + { > + elog(LOG, "could not rename \"%s\" to \"%s\": > %m", > + discard_path, recycle_path); > + } > + } > + else > + { > + if (unlink(discard_path) == 0) > + elog(DEBUG1, "unlinked undo segment \"%s\"", > discard_path); > + else > + elog(LOG, "could not unlink \"%s\": %m", > discard_path); > + } > + old_segment_begin += UndoLogSegmentSize; > + } The code to recycle or delete one segment exists in multiple places (at least also in UndoLogDiscard()). Think it's long enough that it's easily worthwhile to share. > +/* > @@ -1418,12 +1418,18 @@ sendFile(const char *readfilename, const char > *tarfilename, struct stat *statbuf > segmentpath = strstr(filename, "."); > if (segmentpath != NULL) > { > - segmentno = atoi(segmentpath + 1); > - if (segmentno == 0) > + char *end; > + if (strstr(readfilename, "undo")) > + first_blkno = strtol(segmentpath + 1, > &end, 16) / BLCKSZ; > + else > + first_blkno = strtol(segmentpath + 1, > &end, 10) * RELSEG_SIZE; > + if (*end != '\0') > ereport(ERROR, > - (errmsg("invalid > segment number %d in file \"%s\"", > - > segmentno, filename))); > + (errmsg("invalid > segment number in file \"%s\"", > + > filename))); > } > + else > + first_blkno = 0; > } > } Hm. Not a fan of just using strstr() here. Can't quite articulate why. Just somehow rubs me wrong. > /* > * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require > * a relcache entry for the relation. > - * > - * NB: At present, this function may only be used on permanent relations, > which > - * is OK, because we only use it during XLOG replay. If in the future we > - * want to use it on temporary or unlogged relations, we could pass > additional > - * parameters. > */ > Buffer > ReadBufferWithoutRelcache(SmgrId smgrid, RelFileNode rnode, ForkNumber > forkNum, > BlockNumber blockNum, > ReadBufferMode mode, > - BufferAccessStrategy strategy) > + BufferAccessStrategy strategy, > + char relpersistence) > { > bool hit; > > - SMgrRelation smgr = smgropen(smgrid, rnode, InvalidBackendId); > - > - Assert(InRecovery); > + SMgrRelation smgr = smgropen(smgrid, rnode, > + relpersistence > == RELPERSISTENCE_TEMP > + ? MyBackendId > : InvalidBackendId); > > - return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, > blockNum, > + return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum, > mode, strategy, &hit); > } Hm. Using this for undo access means that we don't do any buffer read/hit counting associatable with the relation causing undo to be read/written. That seems like a sizable monitoring defficiency. > /* > + * ForgetBuffer -- drop a buffer from shared buffers > + * > + * If the buffer isn't present in shared buffers, nothing happens. If it is > + * present, it is discarded without making any attempt to write it back out > to > + * the operating system. The caller must therefore somehow be sure that the > + * data won't be needed for anything now or in the future. It assumes that > + * there is no concurrent access to the block, except that it might be being > + * concurrently written. > + */ > +void > +ForgetBuffer(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum, > + BlockNumber blockNum) > +{ > + SMgrRelation smgr = smgropen(smgrid, rnode, InvalidBackendId); > + BufferTag tag; /* identity of target block */ > + uint32 hash; /* hash value for tag */ > + LWLock *partitionLock; /* buffer partition lock for it */ > + int buf_id; > + BufferDesc *bufHdr; > + uint32 buf_state; > + > + /* create a tag so we can lookup the buffer */ > + INIT_BUFFERTAG(tag, smgrid, smgr->smgr_rnode.node, forkNum, blockNum); > + > + /* determine its hash code and partition lock ID */ > + hash = BufTableHashCode(&tag); > + partitionLock = BufMappingPartitionLock(hash); > + > + /* see if the block is in the buffer pool */ > + LWLockAcquire(partitionLock, LW_SHARED); > + buf_id = BufTableLookup(&tag, hash); > + LWLockRelease(partitionLock); > + > + /* didn't find it, so nothing to do */ > + if (buf_id < 0) > + return; > + > + /* take the buffer header lock */ > + bufHdr = GetBufferDescriptor(buf_id); > + buf_state = LockBufHdr(bufHdr); > + /* > + * The buffer might been evicted after we released the partition lock > and > + * before we acquired the buffer header lock. If so, the buffer we've > + * locked might contain some other data which we shouldn't touch. If the > + * buffer hasn't been recycled, we proceed to invalidate it. > + */ > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && > + bufHdr->tag.blockNum == blockNum && > + bufHdr->tag.forkNum == forkNum) > + InvalidateBuffer(bufHdr); /* releases spinlock */ > + else > + UnlockBufHdr(bufHdr, buf_state); Phew, I don't like this code one bit. It imo is a bad idea / unnecessary to look up the buffer, unlock the partition lock, and then recheck identity. And do exactly the same thing again in InvalidateBuffer() (including making a copy of the tag while holding the buffer header lock). Seems like this should be something roughly like ReservePrivateRefCountEntry(); LWLockAcquire(partitionLock, LW_SHARED); buf_id = BufTableLookup(&tag, hash); if (buf_id >= 0) { bufHdr = GetBufferDescriptor(buf_id); buf_state = LockBufHdr(bufHdr); /* * Temporarily acquire pin - that prevents the buffer * from being replaced with one that we did not intend * to target. * * XXX: */ ref = PinBuffer_Locked(bufHdr, strategy); /* release partition lock, acquire exclusively so we can drop */ LWLockRelease(partitionLock); /* loop until nobody else has the buffer pinned */ while (true) { LWLockAcquire(partitionLock, LW_EXCLUSIVE); buf_state = LockBufHdr(buf); /* * Check if somebody else is busy writing the buffer (we * have one pin). */ if (BUF_STATE_GET_REFCOUNT(buf_state) == 1) break; // XXX: Should we assert IO_IN_PROGRESS? Ought to be the // only way to get here. /* wait for IO to finish, without holding locks */ UnlockBufHdr(buf, buf_state); LWLockRelease(partitionLock); Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) == 1); WaitIO(buf); /* buffer identity can't change, we've a pin */ // XXX: Assert that the buffer isn't dirty anymore? There // ought to be no possibility for it to get dirty now. } Assert(!(buf_state & BM_PIN_COUNT_WAITER)); /* * Clear out the buffer's tag and flags. We must do this to ensure that * linear scans of the buffer array don't think the buffer is valid. */ oldFlags = buf_state & BUF_FLAG_MASK; CLEAR_BUFFERTAG(buf->tag); buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); /* remove our refcount */ buf_state -= BUF_REFCOUNT_ONE; UnlockBufHdr(buf, buf_state); /* * Remove the buffer from the lookup hashtable, if it was in there. */ if (oldFlags & BM_TAG_VALID) BufTableDelete(&oldTag, oldHash); /* * Done with mapping lock. */ LWLockRelease(oldPartitionLock); Assert(ref->refcount == 1); ForgetPrivateRefCountEntry(ref); ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf)); } or something in that vein. Now you can validly argue that this is more complicated - but I also think that this is going to be a much hotter path than normal relation drops. <more after some errands> Greetings, Andres Freund