On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > > On Thu, Jan 6, 2022 at 7:22 PM Robert Haas <robertmh...@gmail.com> wrote: >> >> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro <thomas.mu...@gmail.com> wrote: >> > Another problem is that relfilenodes are normally allocated with >> > GetNewOidWithIndex(), and initially match a relation's OID. We'd need >> > a new allocator, and they won't be able to match the OID in general >> > (while we have 32 bit OIDs at least). >> >> Personally I'm not sad about that. Values that are the same in simple >> cases but diverge in more complex cases are kind of a trap for the >> unwary. There's no real reason to have them ever match. Yeah, in >> theory, it makes it easier to tell which file matches which relation, >> but in practice, you always have to double-check in case the table has >> ever been rewritten. It doesn't seem worth continuing to contort the >> code for a property we can't guarantee anyway. > > > Make sense, I have started working on this idea, I will try to post the first > version by early next week.
Here is the first working patch, with that now we don't need to maintain the TombStone file until the next checkpoint. This is still a WIP patch with this I can see my problem related to ALTER DATABASE SET TABLESPACE WAL-logged problem is solved which Robert reported a couple of mails above in the same thread. General idea of the patch: - Change the RelFileNode.relNode to be 64bit wide, out of which 8 bits for fork number and 56 bits for the relNode as shown below. [1] - GetNewRelFileNode() will just generate a new unique relfilenode and check the file existence and if it already exists then throw an error, so no loop. We also need to add the logic for preserving the nextRelNode across restart and also WAL logging it but that is similar to the preserving nextOid. - mdunlinkfork, will directly forget the relfilenode, so we get rid of all unlinking code from the code. - Now, we don't need any post checkpoint unlinking activity. [1] /* * RelNodeId: * * this is a storage type for RelNode. The reasoning behind using this is same * as using the BlockId so refer comment atop BlockId. */ typedef struct RelNodeId { uint32 rn_hi; uint32 rn_lo; } RelNodeId; typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ RelNodeId relNode; /* relation */ } RelFileNode; TODO: There are a couple of TODOs and FIXMEs which I am planning to improve by next week. I am also planning to do the testing where relfilenode consumes more than 32 bits, maybe for that we can set the FirstNormalRelfileNode to higher value for the testing purpose. And, Improve comments. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
From 4a6502c7950969262c6982388865bbc23e531cde Mon Sep 17 00:00:00 2001 From: Dilip Kumar <dilipkumar@localhost.localdomain> Date: Fri, 28 Jan 2022 18:32:35 +0530 Subject: [PATCH v1] Don't wait for next checkpoint to remove unwanted relfilenode Currently, relfilenode is 32 bits wide so if we remove the relfilenode immediately after it is no longer needed then there is risk of reusing the same relfilenode in the same checkpoint. So for avoiding that we delay cleaning up the relfilenode until the next checkpoint. With this patch we are using 56 bits for the relfilenode. Ideally we can make it 64 bits wider but that will increase the size of the BufferTag so for keeping that size same we are making RelFileNode.relNode as 64 bit wider, in that 8 bits will be used for storing the fork number and the remaining 56 bits for the relfilenode. --- contrib/pg_buffercache/pg_buffercache_pages.c | 4 +- contrib/pg_prewarm/autoprewarm.c | 4 +- src/backend/access/common/syncscan.c | 3 +- src/backend/access/gin/ginxlog.c | 5 +- src/backend/access/rmgrdesc/gistdesc.c | 4 +- src/backend/access/rmgrdesc/heapdesc.c | 4 +- src/backend/access/rmgrdesc/nbtdesc.c | 4 +- src/backend/access/rmgrdesc/seqdesc.c | 4 +- src/backend/access/rmgrdesc/xlogdesc.c | 15 +++- src/backend/access/transam/varsup.c | 42 +++++++++- src/backend/access/transam/xlog.c | 57 ++++++++++--- src/backend/access/transam/xloginsert.c | 12 +++ src/backend/access/transam/xlogutils.c | 9 ++- src/backend/catalog/catalog.c | 61 +++----------- src/backend/catalog/heap.c | 6 +- src/backend/catalog/index.c | 4 +- src/backend/catalog/storage.c | 3 +- src/backend/commands/tablecmds.c | 18 +++-- src/backend/replication/logical/decode.c | 1 + src/backend/replication/logical/reorderbuffer.c | 2 +- src/backend/storage/buffer/bufmgr.c | 61 +++++++------- src/backend/storage/buffer/localbuf.c | 8 +- src/backend/storage/freespace/fsmpage.c | 4 +- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/backend/storage/smgr/md.c | 68 +++++----------- src/backend/storage/sync/sync.c | 101 ------------------------ src/backend/utils/adt/dbsize.c | 10 +-- src/backend/utils/cache/relcache.c | 30 ++++--- src/backend/utils/cache/relmapper.c | 39 ++++----- src/backend/utils/misc/pg_controldata.c | 9 ++- src/bin/pg_controldata/pg_controldata.c | 2 + src/bin/pg_rewind/filemap.c | 16 ++-- src/bin/pg_waldump/pg_waldump.c | 14 ++-- src/common/relpath.c | 22 +++--- src/include/access/transam.h | 6 ++ src/include/access/xlog.h | 1 + src/include/catalog/catalog.h | 4 +- src/include/catalog/pg_class.h | 10 +-- src/include/catalog/pg_control.h | 2 + src/include/commands/tablecmds.h | 2 +- src/include/common/relpath.h | 6 +- src/include/storage/buf_internals.h | 12 +-- src/include/storage/relfilenode.h | 66 +++++++++++++++- src/include/storage/sync.h | 1 - src/include/utils/relmapper.h | 6 +- src/test/regress/expected/alter_table.out | 16 ++-- 46 files changed, 405 insertions(+), 374 deletions(-) diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 1bd579f..ddf33ac 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -153,10 +153,10 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) buf_state = LockBufHdr(bufHdr); fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr); - fctx->record[i].relfilenode = bufHdr->tag.rnode.relNode; + fctx->record[i].relfilenode = RELFILENODE_GETRELNODE(bufHdr->tag.rnode); fctx->record[i].reltablespace = bufHdr->tag.rnode.spcNode; fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode; - fctx->record[i].forknum = bufHdr->tag.forkNum; + fctx->record[i].forknum = RELFILENODE_GETFORKNUM(bufHdr->tag.rnode); fctx->record[i].blocknum = bufHdr->tag.blockNum; fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index 5d40fb5..a03fd03 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -617,8 +617,8 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged) { block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode; - block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; - block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; + block_info_array[num_blocks].filenode = RELFILENODE_GETRELNODE(bufHdr->tag.rnode); + block_info_array[num_blocks].forknum = RELFILENODE_GETFORKNUM(bufHdr->tag.rnode); block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; ++num_blocks; } diff --git a/src/backend/access/common/syncscan.c b/src/backend/access/common/syncscan.c index d5b16c5..386de77 100644 --- a/src/backend/access/common/syncscan.c +++ b/src/backend/access/common/syncscan.c @@ -161,7 +161,8 @@ SyncScanShmemInit(void) */ item->location.relfilenode.spcNode = InvalidOid; item->location.relfilenode.dbNode = InvalidOid; - item->location.relfilenode.relNode = InvalidOid; + RELFILENODE_SETRELNODE(item->location.relfilenode, + InvalidRelfileNode); item->location.location = InvalidBlockNumber; item->prev = (i > 0) ? diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c index 87e8366..b73a430 100644 --- a/src/backend/access/gin/ginxlog.c +++ b/src/backend/access/gin/ginxlog.c @@ -100,8 +100,9 @@ ginRedoInsertEntry(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rda BlockNumber blknum; BufferGetTag(buffer, &node, &forknum, &blknum); - elog(ERROR, "failed to add item to index page in %u/%u/%u", - node.spcNode, node.dbNode, node.relNode); + elog(ERROR, "failed to add item to index page in %u/%u/" UINT64_FORMAT, + node.spcNode, node.dbNode, + RELFILENODE_GETRELNODE(node)); } } diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c index 9cab4fa..4ebe661 100644 --- a/src/backend/access/rmgrdesc/gistdesc.c +++ b/src/backend/access/rmgrdesc/gistdesc.c @@ -26,9 +26,9 @@ out_gistxlogPageUpdate(StringInfo buf, gistxlogPageUpdate *xlrec) static void out_gistxlogPageReuse(StringInfo buf, gistxlogPageReuse *xlrec) { - appendStringInfo(buf, "rel %u/%u/%u; blk %u; latestRemovedXid %u:%u", + appendStringInfo(buf, "rel %u/%u/" UINT64_FORMAT "; blk %u; latestRemovedXid %u:%u", xlrec->node.spcNode, xlrec->node.dbNode, - xlrec->node.relNode, xlrec->block, + RELFILENODE_GETRELNODE(xlrec->node), xlrec->block, EpochFromFullTransactionId(xlrec->latestRemovedFullXid), XidFromFullTransactionId(xlrec->latestRemovedFullXid)); } diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c index 6238085..0e024a9 100644 --- a/src/backend/access/rmgrdesc/heapdesc.c +++ b/src/backend/access/rmgrdesc/heapdesc.c @@ -169,10 +169,10 @@ heap2_desc(StringInfo buf, XLogReaderState *record) { xl_heap_new_cid *xlrec = (xl_heap_new_cid *) rec; - appendStringInfo(buf, "rel %u/%u/%u; tid %u/%u", + appendStringInfo(buf, "rel %u/%u/" UINT64_FORMAT "; tid %u/%u", xlrec->target_node.spcNode, xlrec->target_node.dbNode, - xlrec->target_node.relNode, + RELFILENODE_GETRELNODE(xlrec->target_node), ItemPointerGetBlockNumber(&(xlrec->target_tid)), ItemPointerGetOffsetNumber(&(xlrec->target_tid))); appendStringInfo(buf, "; cmin: %u, cmax: %u, combo: %u", diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c index dfbbf4e..78c5eb4 100644 --- a/src/backend/access/rmgrdesc/nbtdesc.c +++ b/src/backend/access/rmgrdesc/nbtdesc.c @@ -100,9 +100,9 @@ btree_desc(StringInfo buf, XLogReaderState *record) { xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec; - appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u:%u", + appendStringInfo(buf, "rel %u/%u/" UINT64_FORMAT "; latestRemovedXid %u:%u", xlrec->node.spcNode, xlrec->node.dbNode, - xlrec->node.relNode, + RELFILENODE_GETRELNODE(xlrec->node), EpochFromFullTransactionId(xlrec->latestRemovedFullXid), XidFromFullTransactionId(xlrec->latestRemovedFullXid)); break; diff --git a/src/backend/access/rmgrdesc/seqdesc.c b/src/backend/access/rmgrdesc/seqdesc.c index d9b1e60..56a9e26 100644 --- a/src/backend/access/rmgrdesc/seqdesc.c +++ b/src/backend/access/rmgrdesc/seqdesc.c @@ -25,9 +25,9 @@ seq_desc(StringInfo buf, XLogReaderState *record) xl_seq_rec *xlrec = (xl_seq_rec *) rec; if (info == XLOG_SEQ_LOG) - appendStringInfo(buf, "rel %u/%u/%u", + appendStringInfo(buf, "rel %u/%u/" UINT64_FORMAT, xlrec->node.spcNode, xlrec->node.dbNode, - xlrec->node.relNode); + RELFILENODE_GETRELNODE(xlrec->node)); } const char * diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c index e7452af..1c5b561 100644 --- a/src/backend/access/rmgrdesc/xlogdesc.c +++ b/src/backend/access/rmgrdesc/xlogdesc.c @@ -45,8 +45,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record) CheckPoint *checkpoint = (CheckPoint *) rec; appendStringInfo(buf, "redo %X/%X; " - "tli %u; prev tli %u; fpw %s; xid %u:%u; oid %u; multi %u; offset %u; " - "oldest xid %u in DB %u; oldest multi %u in DB %u; " + "tli %u; prev tli %u; fpw %s; xid %u:%u; relfilenode " UINT64_FORMAT ";oid %u; " + "multi %u; offset %u; oldest xid %u in DB %u; oldest multi %u in DB %u; " "oldest/newest commit timestamp xid: %u/%u; " "oldest running xid %u; %s", LSN_FORMAT_ARGS(checkpoint->redo), @@ -55,6 +55,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record) checkpoint->fullPageWrites ? "true" : "false", EpochFromFullTransactionId(checkpoint->nextXid), XidFromFullTransactionId(checkpoint->nextXid), + checkpoint->nextRelNode, checkpoint->nextOid, checkpoint->nextMulti, checkpoint->nextMultiOffset, @@ -74,6 +75,13 @@ xlog_desc(StringInfo buf, XLogReaderState *record) memcpy(&nextOid, rec, sizeof(Oid)); appendStringInfo(buf, "%u", nextOid); } + else if (info == XLOG_NEXT_RELFILENODE) + { + RelNode nextRelFilenode; + + memcpy(&nextRelFilenode, rec, sizeof(RelNode)); + appendStringInfo(buf, UINT64_FORMAT, nextRelFilenode); + } else if (info == XLOG_RESTORE_POINT) { xl_restore_point *xlrec = (xl_restore_point *) rec; @@ -169,6 +177,9 @@ xlog_identify(uint8 info) case XLOG_NEXTOID: id = "NEXTOID"; break; + case XLOG_NEXT_RELFILENODE: + id = "NEXT_RELFILENODE"; + break; case XLOG_SWITCH: id = "SWITCH"; break; diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c index 748120a..22396a5 100644 --- a/src/backend/access/transam/varsup.c +++ b/src/backend/access/transam/varsup.c @@ -30,6 +30,9 @@ /* Number of OIDs to prefetch (preallocate) per XLOG write */ #define VAR_OID_PREFETCH 8192 +/* Number of RelFileNode to prefetch (preallocate) per XLOG write */ +#define VAR_RFN_PREFETCH 8192 + /* pointer to "variable cache" in shared memory (set up by shmem.c) */ VariableCache ShmemVariableCache = NULL; @@ -521,8 +524,7 @@ ForceTransactionIdLimitUpdate(void) * wide, counter wraparound will occur eventually, and therefore it is unwise * to assume they are unique unless precautions are taken to make them so. * Hence, this routine should generally not be used directly. The only direct - * callers should be GetNewOidWithIndex() and GetNewRelFileNode() in - * catalog/catalog.c. + * callers should be GetNewOidWithIndex() in catalog/catalog.c. */ Oid GetNewObjectId(void) @@ -613,6 +615,42 @@ SetNextObjectId(Oid nextOid) } /* + * GetNewRelNode + * + * Simmilar to GetNewObjectId but instead of new Oid it generates new + * relfilenode. And the relfilenode is 56 bits wide so we don't need to + * worry about the wraparound case. + */ +RelNode +GetNewRelNode(void) +{ + RelNode result; + + /* safety check, we should never get this far in a HS standby */ + if (RecoveryInProgress()) + elog(ERROR, "cannot assign RelFileNode during recovery"); + + LWLockAcquire(RelNodeGenLock, LW_EXCLUSIVE); + + /* If we run out of logged for use RelNode then we must log more */ + if (ShmemVariableCache->relnodecount == 0) + { + XLogPutNextRelFileNode(ShmemVariableCache->nextRelNode + + VAR_RFN_PREFETCH); + + ShmemVariableCache->relnodecount = VAR_RFN_PREFETCH; + } + + result = ShmemVariableCache->nextRelNode; + (ShmemVariableCache->nextRelNode)++; + (ShmemVariableCache->relnodecount)--; + + LWLockRelease(RelNodeGenLock); + + return result; +} + +/* * StopGeneratingPinnedObjectIds * * This is called once during initdb to force the OID counter up to diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index dfe2a0b..be633dc 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -1541,8 +1541,9 @@ checkXLogConsistency(XLogReaderState *record) if (memcmp(replay_image_masked, primary_image_masked, BLCKSZ) != 0) { elog(FATAL, - "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u", - rnode.spcNode, rnode.dbNode, rnode.relNode, + "inconsistent page found, rel %u/%u/" UINT64_FORMAT ", forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, + RELFILENODE_GETRELNODE(rnode), forknum, blkno); } } @@ -5396,6 +5397,7 @@ BootStrapXLOG(void) checkPoint.nextXid = FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId); checkPoint.nextOid = FirstGenbkiObjectId; + checkPoint.nextRelNode = FirstNormalRelfileNode; checkPoint.nextMulti = FirstMultiXactId; checkPoint.nextMultiOffset = 0; checkPoint.oldestXid = FirstNormalTransactionId; @@ -5409,7 +5411,9 @@ BootStrapXLOG(void) ShmemVariableCache->nextXid = checkPoint.nextXid; ShmemVariableCache->nextOid = checkPoint.nextOid; + ShmemVariableCache->nextRelNode = checkPoint.nextRelNode; ShmemVariableCache->oidCount = 0; + ShmemVariableCache->relnodecount = 0; MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset); AdvanceOldestClogXid(checkPoint.oldestXid); SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB); @@ -7147,7 +7151,9 @@ StartupXLOG(void) /* initialize shared memory variables from the checkpoint record */ ShmemVariableCache->nextXid = checkPoint.nextXid; ShmemVariableCache->nextOid = checkPoint.nextOid; + ShmemVariableCache->nextRelNode = checkPoint.nextRelNode; ShmemVariableCache->oidCount = 0; + ShmemVariableCache->relnodecount = 0; MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset); AdvanceOldestClogXid(checkPoint.oldestXid); SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB); @@ -9259,6 +9265,12 @@ CreateCheckPoint(int flags) checkPoint.nextOid += ShmemVariableCache->oidCount; LWLockRelease(OidGenLock); + LWLockAcquire(RelNodeGenLock, LW_SHARED); + checkPoint.nextRelNode = ShmemVariableCache->nextRelNode; + if (!shutdown) + checkPoint.nextRelNode += ShmemVariableCache->relnodecount; + LWLockRelease(RelNodeGenLock); + MultiXactGetCheckptMulti(shutdown, &checkPoint.nextMulti, &checkPoint.nextMultiOffset, @@ -9405,11 +9417,6 @@ CreateCheckPoint(int flags) END_CRIT_SECTION(); /* - * Let smgr do post-checkpoint cleanup (eg, deleting old files). - */ - SyncPostCheckpoint(); - - /* * Update the average distance between checkpoints if the prior checkpoint * exists. */ @@ -10070,6 +10077,18 @@ XLogPutNextOid(Oid nextOid) } /* + * Simmialr to the XLogPutNextOid but instead of writing NEXTOID log record it + * writes a XLOG_NEXT_RELFILENODE log record. + */ +void +XLogPutNextRelFileNode(RelNode nextrelnode) +{ + XLogBeginInsert(); + XLogRegisterData((char *) (&nextrelnode), sizeof(RelNode)); + (void) XLogInsert(RM_XLOG_ID, XLOG_NEXT_RELFILENODE); +} + +/* * Write an XLOG SWITCH record. * * Here we just blindly issue an XLogInsert request for the record. @@ -10331,6 +10350,16 @@ xlog_redo(XLogReaderState *record) ShmemVariableCache->oidCount = 0; LWLockRelease(OidGenLock); } + if (info == XLOG_NEXT_RELFILENODE) + { + RelNode nextRelNode; + + memcpy(&nextRelNode, XLogRecGetData(record), sizeof(RelNode)); + LWLockAcquire(RelNodeGenLock, LW_EXCLUSIVE); + ShmemVariableCache->nextRelNode = nextRelNode; + ShmemVariableCache->relnodecount = 0; + LWLockRelease(RelNodeGenLock); + } else if (info == XLOG_CHECKPOINT_SHUTDOWN) { CheckPoint checkPoint; @@ -10344,6 +10373,10 @@ xlog_redo(XLogReaderState *record) ShmemVariableCache->nextOid = checkPoint.nextOid; ShmemVariableCache->oidCount = 0; LWLockRelease(OidGenLock); + LWLockAcquire(RelNodeGenLock, LW_EXCLUSIVE); + ShmemVariableCache->nextRelNode = checkPoint.nextRelNode; + ShmemVariableCache->relnodecount = 0; + LWLockRelease(RelNodeGenLock); MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset); @@ -10713,15 +10746,17 @@ xlog_block_info(StringInfo buf, XLogReaderState *record) XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blk); if (forknum != MAIN_FORKNUM) - appendStringInfo(buf, "; blkref #%d: rel %u/%u/%u, fork %u, blk %u", + appendStringInfo(buf, "; blkref #%d: rel %u/%u/" UINT64_FORMAT ", fork %u, blk %u", block_id, - rnode.spcNode, rnode.dbNode, rnode.relNode, + rnode.spcNode, rnode.dbNode, + RELFILENODE_GETRELNODE(rnode), forknum, blk); else - appendStringInfo(buf, "; blkref #%d: rel %u/%u/%u, blk %u", + appendStringInfo(buf, "; blkref #%d: rel %u/%u/" UINT64_FORMAT ", blk %u", block_id, - rnode.spcNode, rnode.dbNode, rnode.relNode, + rnode.spcNode, rnode.dbNode, + RELFILENODE_GETRELNODE(rnode), blk); if (XLogRecHasBlockImage(record, block_id)) appendStringInfoString(buf, " FPW"); diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index c260310..dc5e101 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -244,6 +244,18 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags) regbuf = ®istered_buffers[block_id]; BufferGetTag(buffer, ®buf->rnode, ®buf->forkno, ®buf->block); + + /* + * In the registered buffer we are writing the fork number separately so + * clear it from the rnode. The reason we need to clear this is because if + * we are registering multiple blocks which have the same RelFileNode then + * we will not write the RelFileNode multiple times. So the problem is + * that if those blocks are for different fork numbers then if we keep the + * fork number as part of the RelFileNode.relNode then we can not reuse the + * same RelFileNode. + */ + RELFILENODE_CLEARFORKNUM(regbuf->rnode); + regbuf->page = BufferGetPage(buffer); regbuf->flags = flags; regbuf->rdata_tail = (XLogRecData *) ®buf->rdata_head; diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 90e1c483..d09ead1 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -593,17 +593,18 @@ CreateFakeRelcacheEntry(RelFileNode rnode) rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT; /* We don't know the name of the relation; use relfilenode instead */ - sprintf(RelationGetRelationName(rel), "%u", rnode.relNode); + sprintf(RelationGetRelationName(rel), UINT64_FORMAT, + RELFILENODE_GETRELNODE(rnode)); /* * We set up the lockRelId in case anything tries to lock the dummy - * relation. Note that this is fairly bogus since relNode may be - * different from the relation's OID. It shouldn't really matter though. + * relation. Note we are setting relId to just FirstNormalObjectId which + * is completely bogus. It shouldn't really matter though. * In recovery, we are running by ourselves and can't have any lock * conflicts. While syncing, we already hold AccessExclusiveLock. */ rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode; - rel->rd_lockInfo.lockRelId.relId = rnode.relNode; + rel->rd_lockInfo.lockRelId.relId = FirstNormalObjectId; rel->rd_smgr = NULL; diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index dfd5fb6..5afbd07 100644 --- a/src/backend/catalog/catalog.c +++ b/src/backend/catalog/catalog.c @@ -472,27 +472,18 @@ GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn) /* * GetNewRelFileNode - * Generate a new relfilenode number that is unique within the - * database of the given tablespace. + * Generate a new relfilenode number. * - * If the relfilenode will also be used as the relation's OID, pass the - * opened pg_class catalog, and this routine will guarantee that the result - * is also an unused OID within pg_class. If the result is to be used only - * as a relfilenode for an existing relation, pass NULL for pg_class. - * - * As with GetNewOidWithIndex(), there is some theoretical risk of a race - * condition, but it doesn't seem worth worrying about. - * - * Note: we don't support using this in bootstrap mode. All relations - * created by bootstrap have preassigned OIDs, so there's no need. + * We are using 56 bits for the relfilenode so we expect that to be unique for + * the cluster so if it is already exists then report and error. */ -Oid -GetNewRelFileNode(Oid reltablespace, Relation pg_class, char relpersistence) +RelNode +GetNewRelFileNode(Oid reltablespace, char relpersistence) { RelFileNodeBackend rnode; char *rpath; - bool collides; BackendId backend; + RelNode relNode; /* * If we ever get here during pg_upgrade, there's something wrong; all @@ -525,42 +516,16 @@ GetNewRelFileNode(Oid reltablespace, Relation pg_class, char relpersistence) * are properly detected. */ rnode.backend = backend; + relNode = GetNewRelNode(); + RELFILENODE_SETRELNODE(rnode.node, relNode); - do - { - CHECK_FOR_INTERRUPTS(); - - /* Generate the OID */ - if (pg_class) - rnode.node.relNode = GetNewOidWithIndex(pg_class, ClassOidIndexId, - Anum_pg_class_oid); - else - rnode.node.relNode = GetNewObjectId(); - - /* Check for existing file of same name */ - rpath = relpath(rnode, MAIN_FORKNUM); + /* Check for existing file of same name */ + rpath = relpath(rnode, MAIN_FORKNUM); - if (access(rpath, F_OK) == 0) - { - /* definite collision */ - collides = true; - } - else - { - /* - * Here we have a little bit of a dilemma: if errno is something - * other than ENOENT, should we declare a collision and loop? In - * practice it seems best to go ahead regardless of the errno. If - * there is a colliding file we will get an smgr failure when we - * attempt to create the new relation file. - */ - collides = false; - } - - pfree(rpath); - } while (collides); + if (access(rpath, F_OK) == 0) + elog(ERROR, "new relfilenode file already exists: \"%s\"\n", rpath); - return rnode.node.relNode; + return relNode; } /* diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 7e99de8..d00575a 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -359,7 +359,7 @@ heap_create(const char *relname, * with oid same as relid. */ if (!OidIsValid(relfilenode)) - relfilenode = relid; + relfilenode = GetNewRelFileNode(reltablespace, relpersistence); } /* @@ -1243,8 +1243,8 @@ heap_create_with_catalog(const char *relname, } if (!OidIsValid(relid)) - relid = GetNewRelFileNode(reltablespace, pg_class_desc, - relpersistence); + relid = GetNewOidWithIndex(pg_class_desc, ClassOidIndexId, + Anum_pg_class_oid); } /* diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 2308d40..76e3702 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -935,8 +935,8 @@ index_create(Relation heapRelation, } else { - indexRelationId = - GetNewRelFileNode(tableSpaceId, pg_class, relpersistence); + indexRelationId = GetNewOidWithIndex(pg_class, ClassOidIndexId, + Anum_pg_class_oid); } } diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 9b80755..712e995 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -593,7 +593,8 @@ RestorePendingSyncs(char *startAddress) RelFileNode *rnode; Assert(pendingSyncHash == NULL); - for (rnode = (RelFileNode *) startAddress; rnode->relNode != 0; rnode++) + for (rnode = (RelFileNode *) startAddress; + RELFILENODE_GETRELNODE(*rnode) != 0; rnode++) AddPendingSync(rnode); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 1f0654c..7a048b3 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -3304,7 +3304,7 @@ CheckRelationTableSpaceMove(Relation rel, Oid newTableSpaceId) void SetRelationTableSpace(Relation rel, Oid newTableSpaceId, - Oid newRelFileNode) + RelNode newRelFileNode) { Relation pg_class; HeapTuple tuple; @@ -3324,7 +3324,7 @@ SetRelationTableSpace(Relation rel, /* Update the pg_class row. */ rd_rel->reltablespace = (newTableSpaceId == MyDatabaseTableSpace) ? InvalidOid : newTableSpaceId; - if (OidIsValid(newRelFileNode)) + if (newRelFileNode != InvalidRelfileNode) rd_rel->relfilenode = newRelFileNode; CatalogTupleUpdate(pg_class, &tuple->t_self, tuple); @@ -13441,7 +13441,7 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt) /* If it's a partitioned index, there is no storage to share. */ if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX) { - stmt->oldNode = irel->rd_node.relNode; + stmt->oldNode = RELFILENODE_GETRELNODE(irel->rd_node); stmt->oldCreateSubid = irel->rd_createSubid; stmt->oldFirstRelfilenodeSubid = irel->rd_firstRelfilenodeSubid; } @@ -14290,7 +14290,7 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) { Relation rel; Oid reltoastrelid; - Oid newrelfilenode; + RelNode newrelfilenode; RelFileNode newrnode; List *reltoastidxids = NIL; ListCell *lc; @@ -14320,15 +14320,17 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) } /* - * Relfilenodes are not unique in databases across tablespaces, so we need - * to allocate a new one in the new tablespace. + * Generate a new relfilenode for the table in new tablespace. + * + * XXX Relfilenodes are unique in the cluster, so can use the same + * relfilenodein the new tablespace? */ - newrelfilenode = GetNewRelFileNode(newTableSpace, NULL, + newrelfilenode = GetNewRelFileNode(newTableSpace, rel->rd_rel->relpersistence); /* Open old and new relation */ newrnode = rel->rd_node; - newrnode.relNode = newrelfilenode; + RELFILENODE_SETRELNODE(newrnode, newrelfilenode); newrnode.spcNode = newTableSpace; /* hand off to AM to actually create the new filenode and copy the data */ diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 3fb5a92..23822c1 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -154,6 +154,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) break; case XLOG_NOOP: case XLOG_NEXTOID: + case XLOG_NEXT_RELFILENODE: case XLOG_SWITCH: case XLOG_BACKUP_END: case XLOG_PARAMETER_CHANGE: diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 19b2ba2..143d403 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -2134,7 +2134,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, Assert(snapshot_now); reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode, - change->data.tp.relnode.relNode); + RELFILENODE_GETRELNODE(change->data.tp.relnode)); /* * Mapped catalog tuple without data, emitted while diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index a2512e7..88d276c 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -818,7 +818,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, smgr->smgr_rnode.node.dbNode, - smgr->smgr_rnode.node.relNode, + RELFILENODE_GETRELNODE(smgr->smgr_rnode.node), smgr->smgr_rnode.backend, isExtend); @@ -880,7 +880,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, smgr->smgr_rnode.node.dbNode, - smgr->smgr_rnode.node.relNode, + RELFILENODE_GETRELNODE(smgr->smgr_rnode.node), smgr->smgr_rnode.backend, isExtend, found); @@ -1070,7 +1070,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, smgr->smgr_rnode.node.dbNode, - smgr->smgr_rnode.node.relNode, + RELFILENODE_GETRELNODE(smgr->smgr_rnode.node), smgr->smgr_rnode.backend, isExtend, found); @@ -1249,7 +1249,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, smgr->smgr_rnode.node.dbNode, - smgr->smgr_rnode.node.relNode); + RELFILENODE_GETRELNODE(smgr->smgr_rnode.node)); FlushBuffer(buf, NULL); LWLockRelease(BufferDescriptorGetContentLock(buf)); @@ -1260,7 +1260,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum, smgr->smgr_rnode.node.spcNode, smgr->smgr_rnode.node.dbNode, - smgr->smgr_rnode.node.relNode); + RELFILENODE_GETRELNODE(smgr->smgr_rnode.node)); } else { @@ -1640,7 +1640,7 @@ ReleaseAndReadBuffer(Buffer buffer, bufHdr = GetLocalBufferDescriptor(-buffer - 1); if (bufHdr->tag.blockNum == blockNum && RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) && - bufHdr->tag.forkNum == forkNum) + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode) == forkNum) return buffer; ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer); LocalRefCount[-buffer - 1]--; @@ -1651,7 +1651,7 @@ ReleaseAndReadBuffer(Buffer buffer, /* we have pin, so it's ok to examine tag without spinlock */ if (bufHdr->tag.blockNum == blockNum && RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node) && - bufHdr->tag.forkNum == forkNum) + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode) == forkNum) return buffer; UnpinBuffer(bufHdr, true); } @@ -1993,8 +1993,8 @@ BufferSync(int flags) item = &CkptBufferIds[num_to_scan++]; item->buf_id = buf_id; item->tsId = bufHdr->tag.rnode.spcNode; - item->relNode = bufHdr->tag.rnode.relNode; - item->forkNum = bufHdr->tag.forkNum; + item->relNode = RELFILENODE_GETRELNODE(bufHdr->tag.rnode); + item->forkNum = RELFILENODE_GETFORKNUM(bufHdr->tag.rnode); item->blockNum = bufHdr->tag.blockNum; } @@ -2701,7 +2701,8 @@ PrintBufferLeakWarning(Buffer buffer) } /* theoretically we should lock the bufhdr here */ - path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum); + path = relpathbackend(buf->tag.rnode, backend, + RELFILENODE_GETFORKNUM(buf->tag.rnode)); buf_state = pg_atomic_read_u32(&buf->state); elog(WARNING, "buffer refcount leak: [%03d] " @@ -2781,7 +2782,7 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum, /* pinned, so OK to read tag without spinlock */ *rnode = bufHdr->tag.rnode; - *forknum = bufHdr->tag.forkNum; + *forknum = RELFILENODE_GETFORKNUM(bufHdr->tag.rnode); *blknum = bufHdr->tag.blockNum; } @@ -2833,11 +2834,11 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) if (reln == NULL) reln = smgropen(buf->tag.rnode, InvalidBackendId); - TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum, + TRACE_POSTGRESQL_BUFFER_FLUSH_START(RELFILENODE_GETFORKNUM(buf->tag.rnode), buf->tag.blockNum, reln->smgr_rnode.node.spcNode, reln->smgr_rnode.node.dbNode, - reln->smgr_rnode.node.relNode); + RELFILENODE_GETRELNODE(reln->smgr_rnode.node)); buf_state = LockBufHdr(buf); @@ -2892,7 +2893,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) * bufToWrite is either the shared buffer or a copy, as appropriate. */ smgrwrite(reln, - buf->tag.forkNum, + RELFILENODE_GETFORKNUM(buf->tag.rnode), buf->tag.blockNum, bufToWrite, false); @@ -2913,11 +2914,11 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) */ TerminateBufferIO(buf, true, 0); - TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum, + TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(RELFILENODE_GETFORKNUM(buf->tag.rnode), buf->tag.blockNum, reln->smgr_rnode.node.spcNode, reln->smgr_rnode.node.dbNode, - reln->smgr_rnode.node.relNode); + RELFILENODE_GETRELNODE(reln->smgr_rnode.node)); /* Pop the error context stack */ error_context_stack = errcallback.previous; @@ -3142,7 +3143,7 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, for (j = 0; j < nforks; j++) { if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) && - bufHdr->tag.forkNum == forkNum[j] && + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode) == forkNum[j] && bufHdr->tag.blockNum >= firstDelBlock[j]) { InvalidateBuffer(bufHdr); /* releases spinlock */ @@ -3374,7 +3375,7 @@ FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum, buf_state = LockBufHdr(bufHdr); if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && - bufHdr->tag.forkNum == forkNum && + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode) == forkNum && bufHdr->tag.blockNum >= firstDelBlock) InvalidateBuffer(bufHdr); /* releases spinlock */ else @@ -3528,7 +3529,7 @@ FlushRelationBuffers(Relation rel) PageSetChecksumInplace(localpage, bufHdr->tag.blockNum); smgrwrite(RelationGetSmgr(rel), - bufHdr->tag.forkNum, + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode), bufHdr->tag.blockNum, localpage, false); @@ -4491,7 +4492,8 @@ AbortBufferIO(void) /* Buffer is pinned, so we can read tag without spinlock */ char *path; - path = relpathperm(buf->tag.rnode, buf->tag.forkNum); + path = relpathperm(buf->tag.rnode, + RELFILENODE_GETFORKNUM(buf->tag.rnode)); ereport(WARNING, (errcode(ERRCODE_IO_ERROR), errmsg("could not write block %u of %s", @@ -4515,7 +4517,8 @@ shared_buffer_write_error_callback(void *arg) /* Buffer is pinned, so we can read the tag without locking the spinlock */ if (bufHdr != NULL) { - char *path = relpathperm(bufHdr->tag.rnode, bufHdr->tag.forkNum); + char *path = relpathperm(bufHdr->tag.rnode, + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode)); errcontext("writing block %u of relation %s", bufHdr->tag.blockNum, path); @@ -4534,7 +4537,7 @@ local_buffer_write_error_callback(void *arg) if (bufHdr != NULL) { char *path = relpathbackend(bufHdr->tag.rnode, MyBackendId, - bufHdr->tag.forkNum); + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode)); errcontext("writing block %u of relation %s", bufHdr->tag.blockNum, path); @@ -4551,9 +4554,9 @@ rnode_comparator(const void *p1, const void *p2) RelFileNode n1 = *(const RelFileNode *) p1; RelFileNode n2 = *(const RelFileNode *) p2; - if (n1.relNode < n2.relNode) + if (RELFILENODE_GETRELNODE(n1) < RELFILENODE_GETRELNODE(n2)) return -1; - else if (n1.relNode > n2.relNode) + else if (RELFILENODE_GETRELNODE(n1) > RELFILENODE_GETRELNODE(n2)) return 1; if (n1.dbNode < n2.dbNode) @@ -4634,9 +4637,9 @@ buffertag_comparator(const BufferTag *ba, const BufferTag *bb) if (ret != 0) return ret; - if (ba->forkNum < bb->forkNum) + if (RELFILENODE_GETFORKNUM(ba->rnode) < RELFILENODE_GETFORKNUM(bb->rnode)) return -1; - if (ba->forkNum > bb->forkNum) + if (RELFILENODE_GETFORKNUM(ba->rnode) > RELFILENODE_GETFORKNUM(bb->rnode)) return 1; if (ba->blockNum < bb->blockNum) @@ -4801,7 +4804,8 @@ IssuePendingWritebacks(WritebackContext *context) /* different file, stop */ if (!RelFileNodeEquals(cur->tag.rnode, next->tag.rnode) || - cur->tag.forkNum != next->tag.forkNum) + RELFILENODE_GETFORKNUM(cur->tag.rnode) < + RELFILENODE_GETFORKNUM(next->tag.rnode)) break; /* ok, block queued twice, skip */ @@ -4820,7 +4824,8 @@ IssuePendingWritebacks(WritebackContext *context) /* and finally tell the kernel to write the data to storage */ reln = smgropen(tag.rnode, InvalidBackendId); - smgrwriteback(reln, tag.forkNum, tag.blockNum, nblocks); + smgrwriteback(reln, RELFILENODE_GETFORKNUM(tag.rnode), tag.blockNum, + nblocks); } context->nr_pending = 0; diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index e71f95a..2892733 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -221,7 +221,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, /* And write... */ smgrwrite(oreln, - bufHdr->tag.forkNum, + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode), bufHdr->tag.blockNum, localpage, false); @@ -338,14 +338,14 @@ DropRelFileNodeLocalBuffers(RelFileNode rnode, ForkNumber forkNum, if ((buf_state & BM_TAG_VALID) && RelFileNodeEquals(bufHdr->tag.rnode, rnode) && - bufHdr->tag.forkNum == forkNum && + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode) == forkNum && bufHdr->tag.blockNum >= firstDelBlock) { if (LocalRefCount[i] != 0) elog(ERROR, "block %u of %s is still referenced (local %u)", bufHdr->tag.blockNum, relpathbackend(bufHdr->tag.rnode, MyBackendId, - bufHdr->tag.forkNum), + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode)), LocalRefCount[i]); /* Remove entry from hashtable */ hresult = (LocalBufferLookupEnt *) @@ -389,7 +389,7 @@ DropRelFileNodeAllLocalBuffers(RelFileNode rnode) elog(ERROR, "block %u of %s is still referenced (local %u)", bufHdr->tag.blockNum, relpathbackend(bufHdr->tag.rnode, MyBackendId, - bufHdr->tag.forkNum), + RELFILENODE_GETFORKNUM(bufHdr->tag.rnode)), LocalRefCount[i]); /* Remove entry from hashtable */ hresult = (LocalBufferLookupEnt *) diff --git a/src/backend/storage/freespace/fsmpage.c b/src/backend/storage/freespace/fsmpage.c index d165b35..cbb667f 100644 --- a/src/backend/storage/freespace/fsmpage.c +++ b/src/backend/storage/freespace/fsmpage.c @@ -273,8 +273,8 @@ restart: BlockNumber blknum; BufferGetTag(buf, &rnode, &forknum, &blknum); - elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/%u", - blknum, rnode.spcNode, rnode.dbNode, rnode.relNode); + elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/" UINT64_FORMAT, + blknum, rnode.spcNode, rnode.dbNode, RELFILENODE_GETRELNODE(rnode)); /* make sure we hold an exclusive lock */ if (!exclusive_lock_held) diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt index 6c7cf6c..1eb6d78 100644 --- a/src/backend/storage/lmgr/lwlocknames.txt +++ b/src/backend/storage/lmgr/lwlocknames.txt @@ -53,3 +53,4 @@ XactTruncationLock 44 # 45 was XactTruncationLock until removal of BackendRandomLock WrapLimitsVacuumLock 46 NotifyQueueTailLock 47 +RelNodeGenLock 48 \ No newline at end of file diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index d26c915..8e2c60f 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -124,8 +124,6 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, static MdfdVec *mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior); static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); -static void register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum, - BlockNumber segno); static void register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, BlockNumber segno); static void _fdvec_resize(SMgrRelation reln, @@ -321,36 +319,25 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) /* * Delete or truncate the first segment. */ - if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode)) + if (!RelFileNodeBackendIsTemp(rnode)) { - if (!RelFileNodeBackendIsTemp(rnode)) - { - /* Prevent other backends' fds from holding on to the disk space */ - ret = do_truncate(path); - - /* Forget any pending sync requests for the first segment */ - register_forget_request(rnode, forkNum, 0 /* first seg */ ); - } - else - ret = 0; + /* Prevent other backends' fds from holding on to the disk space */ + ret = do_truncate(path); - /* Next unlink the file, unless it was already found to be missing */ - if (ret == 0 || errno != ENOENT) - { - ret = unlink(path); - if (ret < 0 && errno != ENOENT) - ereport(WARNING, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", path))); - } + /* Forget any pending sync requests for the first segment */ + register_forget_request(rnode, forkNum, 0 /* first seg */ ); } else - { - /* Prevent other backends' fds from holding on to the disk space */ - ret = do_truncate(path); + ret = 0; - /* Register request to unlink first segment later */ - register_unlink_segment(rnode, forkNum, 0 /* first seg */ ); + /* Next unlink the file, unless it was already found to be missing */ + if (ret == 0 || errno != ENOENT) + { + ret = unlink(path); + if (ret < 0 && errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); } /* @@ -640,7 +627,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum, reln->smgr_rnode.node.spcNode, reln->smgr_rnode.node.dbNode, - reln->smgr_rnode.node.relNode, + RELFILENODE_GETRELNODE(reln->smgr_rnode.node1), reln->smgr_rnode.backend); v = _mdfd_getseg(reln, forknum, blocknum, false, @@ -655,7 +642,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum, reln->smgr_rnode.node.spcNode, reln->smgr_rnode.node.dbNode, - reln->smgr_rnode.node.relNode, + RELFILENODE_GETRELNODE(reln->smgr_rnode.node1), reln->smgr_rnode.backend, nbytes, BLCKSZ); @@ -710,7 +697,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum, reln->smgr_rnode.node.spcNode, reln->smgr_rnode.node.dbNode, - reln->smgr_rnode.node.relNode, + RELFILENODE_GETRELNODE(reln->smgr_rnode.node1), reln->smgr_rnode.backend); v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, @@ -725,7 +712,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum, reln->smgr_rnode.node.spcNode, reln->smgr_rnode.node.dbNode, - reln->smgr_rnode.node.relNode, + RELFILENODE_GETRELNODE(reln->smgr_rnode.node1), reln->smgr_rnode.backend, nbytes, BLCKSZ); @@ -995,23 +982,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg) } /* - * register_unlink_segment() -- Schedule a file to be deleted after next checkpoint - */ -static void -register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum, - BlockNumber segno) -{ - FileTag tag; - - INIT_MD_FILETAG(tag, rnode.node, forknum, segno); - - /* Should never be used with temp relations */ - Assert(!RelFileNodeBackendIsTemp(rnode)); - - RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, true /* retryOnError */ ); -} - -/* * register_forget_request() -- forget any fsyncs for a relation fork's segment */ static void @@ -1036,7 +1006,7 @@ ForgetDatabaseSyncRequests(Oid dbid) rnode.dbNode = dbid; rnode.spcNode = 0; - rnode.relNode = 0; + RELFILENODE_SETRELNODE(rnode, 0); INIT_MD_FILETAG(tag, rnode, InvalidForkNumber, InvalidBlockNumber); diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index 543f691..46a1242 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -188,92 +188,6 @@ SyncPreCheckpoint(void) } /* - * SyncPostCheckpoint() -- Do post-checkpoint work - * - * Remove any lingering files that can now be safely removed. - */ -void -SyncPostCheckpoint(void) -{ - int absorb_counter; - ListCell *lc; - - absorb_counter = UNLINKS_PER_ABSORB; - foreach(lc, pendingUnlinks) - { - PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(lc); - char path[MAXPGPATH]; - - /* Skip over any canceled entries */ - if (entry->canceled) - continue; - - /* - * New entries are appended to the end, so if the entry is new we've - * reached the end of old entries. - * - * Note: if just the right number of consecutive checkpoints fail, we - * could be fooled here by cycle_ctr wraparound. However, the only - * consequence is that we'd delay unlinking for one more checkpoint, - * which is perfectly tolerable. - */ - if (entry->cycle_ctr == checkpoint_cycle_ctr) - break; - - /* Unlink the file */ - if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) - { - /* - * There's a race condition, when the database is dropped at the - * same time that we process the pending unlink requests. If the - * DROP DATABASE deletes the file before we do, we will get ENOENT - * here. rmtree() also has to ignore ENOENT errors, to deal with - * the possibility that we delete the file first. - */ - if (errno != ENOENT) - ereport(WARNING, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", path))); - } - - /* Mark the list entry as canceled, just in case */ - entry->canceled = true; - - /* - * As in ProcessSyncRequests, we don't want to stop absorbing fsync - * requests for a long time when there are many deletions to be done. - * We can safely call AbsorbSyncRequests() at this point in the loop. - */ - if (--absorb_counter <= 0) - { - AbsorbSyncRequests(); - absorb_counter = UNLINKS_PER_ABSORB; - } - } - - /* - * If we reached the end of the list, we can just remove the whole list - * (remembering to pfree all the PendingUnlinkEntry objects). Otherwise, - * we must keep the entries at or after "lc". - */ - if (lc == NULL) - { - list_free_deep(pendingUnlinks); - pendingUnlinks = NIL; - } - else - { - int ntodelete = list_cell_number(pendingUnlinks, lc); - - for (int i = 0; i < ntodelete; i++) - pfree(list_nth(pendingUnlinks, i)); - - pendingUnlinks = list_delete_first_n(pendingUnlinks, ntodelete); - } -} - -/* * ProcessSyncRequests() -- Process queued fsync requests. */ void @@ -519,21 +433,6 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type) entry->canceled = true; } } - else if (type == SYNC_UNLINK_REQUEST) - { - /* Unlink request: put it in the linked list */ - MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt); - PendingUnlinkEntry *entry; - - entry = palloc(sizeof(PendingUnlinkEntry)); - entry->tag = *ftag; - entry->cycle_ctr = checkpoint_cycle_ctr; - entry->canceled = false; - - pendingUnlinks = lappend(pendingUnlinks, entry); - - MemoryContextSwitchTo(oldcxt); - } else { /* Normal case: enter a request to fsync this segment */ diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c index 3a2f2e1..dc8bc52 100644 --- a/src/backend/utils/adt/dbsize.c +++ b/src/backend/utils/adt/dbsize.c @@ -945,21 +945,21 @@ pg_relation_filepath(PG_FUNCTION_ARGS) else rnode.dbNode = MyDatabaseId; if (relform->relfilenode) - rnode.relNode = relform->relfilenode; + RELFILENODE_SETRELNODE(rnode, relform->relfilenode); else /* Consult the relation mapper */ - rnode.relNode = RelationMapOidToFilenode(relid, - relform->relisshared); + RELFILENODE_SETRELNODE(rnode, RelationMapOidToFilenode(relid, + relform->relisshared)); } else { /* no storage, return NULL */ - rnode.relNode = InvalidOid; + RELFILENODE_SETRELNODE(rnode, InvalidRelfileNode); /* some compilers generate warnings without these next two lines */ rnode.dbNode = InvalidOid; rnode.spcNode = InvalidOid; } - if (!OidIsValid(rnode.relNode)) + if (RELFILENODE_GETRELNODE(rnode) == InvalidRelfileNode) { ReleaseSysCache(tuple); PG_RETURN_NULL(); diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 2e760e8..f0196c8 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -1288,7 +1288,7 @@ retry: static void RelationInitPhysicalAddr(Relation relation) { - Oid oldnode = relation->rd_node.relNode; + Oid oldnode = RELFILENODE_GETRELNODE(relation->rd_node); /* these relations kinds never have storage */ if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) @@ -1335,15 +1335,16 @@ RelationInitPhysicalAddr(Relation relation) heap_freetuple(phys_tuple); } - relation->rd_node.relNode = relation->rd_rel->relfilenode; + RELFILENODE_SETRELNODE(relation->rd_node, + relation->rd_rel->relfilenode); } else { /* Consult the relation mapper */ - relation->rd_node.relNode = - RelationMapOidToFilenode(relation->rd_id, - relation->rd_rel->relisshared); - if (!OidIsValid(relation->rd_node.relNode)) + RELFILENODE_SETRELNODE(relation->rd_node, + RelationMapOidToFilenode(relation->rd_id, + relation->rd_rel->relisshared)); + if (RELFILENODE_GETRELNODE(relation->rd_node) == InvalidRelfileNode) elog(ERROR, "could not find relation mapping for relation \"%s\", OID %u", RelationGetRelationName(relation), relation->rd_id); } @@ -1353,7 +1354,8 @@ RelationInitPhysicalAddr(Relation relation) * rd_firstRelfilenodeSubid. No subtransactions start or end while in * parallel mode, so the specific SubTransactionId does not matter. */ - if (IsParallelWorker() && oldnode != relation->rd_node.relNode) + if (IsParallelWorker() && oldnode != + RELFILENODE_GETRELNODE(relation->rd_node)) { if (RelFileNodeSkippingWAL(relation->rd_node)) relation->rd_firstRelfilenodeSubid = TopSubTransactionId; @@ -1958,13 +1960,14 @@ formrdesc(const char *relationName, Oid relationReltype, /* * All relations made with formrdesc are mapped. This is necessarily so * because there is no other way to know what filenode they currently - * have. In bootstrap mode, add them to the initial relation mapper data, - * specifying that the initial filenode is the same as the OID. + * have. In bootstrap mode, add them to the initial relation mapper data. + * + * TODO: Is it right to allocate new relnode here?? */ relation->rd_rel->relfilenode = InvalidOid; if (IsBootstrapProcessingMode()) RelationMapUpdateMap(RelationGetRelid(relation), - RelationGetRelid(relation), + GetNewRelNode(), isshared, true); /* @@ -3673,7 +3676,7 @@ RelationBuildLocalRelation(const char *relname, void RelationSetNewRelfilenode(Relation relation, char persistence) { - Oid newrelfilenode; + RelNode newrelfilenode; Relation pg_class; HeapTuple tuple; Form_pg_class classform; @@ -3682,7 +3685,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence) RelFileNode newrnode; /* Allocate a new relfilenode */ - newrelfilenode = GetNewRelFileNode(relation->rd_rel->reltablespace, NULL, + newrelfilenode = GetNewRelFileNode(relation->rd_rel->reltablespace, persistence); /* @@ -3711,7 +3714,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence) * caught here, if GetNewRelFileNode messes up for any reason. */ newrnode = relation->rd_node; - newrnode.relNode = newrelfilenode; + RELFILENODE_SETRELNODE(newrnode, newrelfilenode); + if (RELKIND_HAS_TABLE_AM(relation->rd_rel->relkind)) { diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c index 4f6811f..740753d 100644 --- a/src/backend/utils/cache/relmapper.c +++ b/src/backend/utils/cache/relmapper.c @@ -79,7 +79,7 @@ typedef struct RelMapping { Oid mapoid; /* OID of a catalog */ - Oid mapfilenode; /* its filenode number */ + RelNodeId mapfilenode; /* its filenode number */ } RelMapping; typedef struct RelMapFile @@ -132,7 +132,7 @@ static RelMapFile pending_local_updates; /* non-export function prototypes */ -static void apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode, +static void apply_map_update(RelMapFile *map, Oid relationId, uint64 fileNode, bool add_okay); static void merge_map_updates(RelMapFile *map, const RelMapFile *updates, bool add_okay); @@ -155,7 +155,7 @@ static void perform_relmap_update(bool shared, const RelMapFile *updates); * Returns InvalidOid if the OID is not known (which should never happen, * but the caller is in a better position to report a meaningful error). */ -Oid +RelNode RelationMapOidToFilenode(Oid relationId, bool shared) { const RelMapFile *map; @@ -168,13 +168,13 @@ RelationMapOidToFilenode(Oid relationId, bool shared) for (i = 0; i < map->num_mappings; i++) { if (relationId == map->mappings[i].mapoid) - return map->mappings[i].mapfilenode; + return RELNODEID_GET_RELNODE(map->mappings[i].mapfilenode); } map = &shared_map; for (i = 0; i < map->num_mappings; i++) { if (relationId == map->mappings[i].mapoid) - return map->mappings[i].mapfilenode; + return RELNODEID_GET_RELNODE(map->mappings[i].mapfilenode); } } else @@ -183,17 +183,17 @@ RelationMapOidToFilenode(Oid relationId, bool shared) for (i = 0; i < map->num_mappings; i++) { if (relationId == map->mappings[i].mapoid) - return map->mappings[i].mapfilenode; + return RELNODEID_GET_RELNODE(map->mappings[i].mapfilenode); } map = &local_map; for (i = 0; i < map->num_mappings; i++) { if (relationId == map->mappings[i].mapoid) - return map->mappings[i].mapfilenode; + return RELNODEID_GET_RELNODE(map->mappings[i].mapfilenode); } } - return InvalidOid; + return InvalidRelfileNode; } /* @@ -209,7 +209,7 @@ RelationMapOidToFilenode(Oid relationId, bool shared) * relfilenode doesn't pertain to a mapped relation. */ Oid -RelationMapFilenodeToOid(Oid filenode, bool shared) +RelationMapFilenodeToOid(RelNode filenode, bool shared) { const RelMapFile *map; int32 i; @@ -220,13 +220,13 @@ RelationMapFilenodeToOid(Oid filenode, bool shared) map = &active_shared_updates; for (i = 0; i < map->num_mappings; i++) { - if (filenode == map->mappings[i].mapfilenode) + if (filenode == RELNODEID_GET_RELNODE(map->mappings[i].mapfilenode)) return map->mappings[i].mapoid; } map = &shared_map; for (i = 0; i < map->num_mappings; i++) { - if (filenode == map->mappings[i].mapfilenode) + if (filenode == RELNODEID_GET_RELNODE(map->mappings[i].mapfilenode)) return map->mappings[i].mapoid; } } @@ -235,13 +235,13 @@ RelationMapFilenodeToOid(Oid filenode, bool shared) map = &active_local_updates; for (i = 0; i < map->num_mappings; i++) { - if (filenode == map->mappings[i].mapfilenode) + if (filenode == RELNODEID_GET_RELNODE(map->mappings[i].mapfilenode)) return map->mappings[i].mapoid; } map = &local_map; for (i = 0; i < map->num_mappings; i++) { - if (filenode == map->mappings[i].mapfilenode) + if (filenode == RELNODEID_GET_RELNODE(map->mappings[i].mapfilenode)) return map->mappings[i].mapoid; } } @@ -258,7 +258,7 @@ RelationMapFilenodeToOid(Oid filenode, bool shared) * immediately. Otherwise it is made pending until CommandCounterIncrement. */ void -RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared, +RelationMapUpdateMap(Oid relationId, RelNode fileNode, bool shared, bool immediate) { RelMapFile *map; @@ -316,7 +316,8 @@ RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared, * add_okay = false to draw an error if not. */ static void -apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode, bool add_okay) +apply_map_update(RelMapFile *map, Oid relationId, RelNode fileNode, + bool add_okay) { int32 i; @@ -325,7 +326,7 @@ apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode, bool add_okay) { if (relationId == map->mappings[i].mapoid) { - map->mappings[i].mapfilenode = fileNode; + RELNODEID_SET_RELNODE(map->mappings[i].mapfilenode, fileNode); return; } } @@ -337,7 +338,8 @@ apply_map_update(RelMapFile *map, Oid relationId, Oid fileNode, bool add_okay) if (map->num_mappings >= MAX_MAPPINGS) elog(ERROR, "ran out of space in relation map"); map->mappings[map->num_mappings].mapoid = relationId; - map->mappings[map->num_mappings].mapfilenode = fileNode; + RELNODEID_SET_RELNODE(map->mappings[map->num_mappings].mapfilenode, + fileNode); map->num_mappings++; } @@ -356,7 +358,8 @@ merge_map_updates(RelMapFile *map, const RelMapFile *updates, bool add_okay) { apply_map_update(map, updates->mappings[i].mapoid, - updates->mappings[i].mapfilenode, + RELNODEID_GET_RELNODE( + updates->mappings[i].mapfilenode), add_okay); } } diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c index 781f8b8..85ed88c 100644 --- a/src/backend/utils/misc/pg_controldata.c +++ b/src/backend/utils/misc/pg_controldata.c @@ -79,8 +79,8 @@ pg_control_system(PG_FUNCTION_ARGS) Datum pg_control_checkpoint(PG_FUNCTION_ARGS) { - Datum values[18]; - bool nulls[18]; + Datum values[19]; + bool nulls[19]; TupleDesc tupdesc; HeapTuple htup; ControlFileData *ControlFile; @@ -129,6 +129,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS) XIDOID, -1, 0); TupleDescInitEntry(tupdesc, (AttrNumber) 18, "checkpoint_time", TIMESTAMPTZOID, -1, 0); + TupleDescInitEntry(tupdesc, (AttrNumber) 19, "next_relfilenode", + INT8OID, -1, 0); tupdesc = BlessTupleDesc(tupdesc); /* Read the control file. */ @@ -202,6 +204,9 @@ pg_control_checkpoint(PG_FUNCTION_ARGS) values[17] = TimestampTzGetDatum(time_t_to_timestamptz(ControlFile->checkPointCopy.time)); nulls[17] = false; + values[18] = Int64GetDatum(ControlFile->checkPointCopy.nextRelNode); + nulls[18] = false; + htup = heap_form_tuple(tupdesc, values, nulls); PG_RETURN_DATUM(HeapTupleGetDatum(htup)); diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c index f911f98..4f14a1b 100644 --- a/src/bin/pg_controldata/pg_controldata.c +++ b/src/bin/pg_controldata/pg_controldata.c @@ -250,6 +250,8 @@ main(int argc, char *argv[]) printf(_("Latest checkpoint's NextXID: %u:%u\n"), EpochFromFullTransactionId(ControlFile->checkPointCopy.nextXid), XidFromFullTransactionId(ControlFile->checkPointCopy.nextXid)); + printf(_("Latest checkpoint's NextRelFileNode: " UINT64_FORMAT "\n"), + ControlFile->checkPointCopy.nextRelNode); printf(_("Latest checkpoint's NextOID: %u\n"), ControlFile->checkPointCopy.nextOid); printf(_("Latest checkpoint's NextMultiXactId: %u\n"), diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c index 7211090..7d626b7 100644 --- a/src/bin/pg_rewind/filemap.c +++ b/src/bin/pg_rewind/filemap.c @@ -512,6 +512,7 @@ isRelDataFile(const char *path) RelFileNode rnode; unsigned int segNo; int nmatch; + uint64 relNode; bool matched; /*---- @@ -535,11 +536,12 @@ isRelDataFile(const char *path) */ rnode.spcNode = InvalidOid; rnode.dbNode = InvalidOid; - rnode.relNode = InvalidOid; + RELFILENODE_SETRELNODE(rnode, InvalidRelfileNode); segNo = 0; matched = false; - nmatch = sscanf(path, "global/%u.%u", &rnode.relNode, &segNo); + nmatch = sscanf(path, "global/" UINT64_FORMAT ".%u", &relNode, &segNo); + RELFILENODE_SETRELNODE(rnode, relNode); if (nmatch == 1 || nmatch == 2) { rnode.spcNode = GLOBALTABLESPACE_OID; @@ -548,8 +550,9 @@ isRelDataFile(const char *path) } else { - nmatch = sscanf(path, "base/%u/%u.%u", - &rnode.dbNode, &rnode.relNode, &segNo); + nmatch = sscanf(path, "base/%u/" UINT64_FORMAT ".%u", + &rnode.dbNode, &relNode, &segNo); + RELFILENODE_SETRELNODE(rnode, relNode); if (nmatch == 2 || nmatch == 3) { rnode.spcNode = DEFAULTTABLESPACE_OID; @@ -557,9 +560,10 @@ isRelDataFile(const char *path) } else { - nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/%u.%u", - &rnode.spcNode, &rnode.dbNode, &rnode.relNode, + nmatch = sscanf(path, "pg_tblspc/%u/" TABLESPACE_VERSION_DIRECTORY "/%u/" UINT64_FORMAT ".%u", + &rnode.spcNode, &rnode.dbNode, &relNode, &segNo); + RELFILENODE_SETRELNODE(rnode, relNode); if (nmatch == 3 || nmatch == 4) matched = true; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1..e88cfdf 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -518,15 +518,17 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record) XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blk); if (forknum != MAIN_FORKNUM) - printf(", blkref #%d: rel %u/%u/%u fork %s blk %u", + printf(", blkref #%d: rel %u/%u/" UINT64_FORMAT " fork %s blk %u", block_id, - rnode.spcNode, rnode.dbNode, rnode.relNode, + rnode.spcNode, rnode.dbNode, + RELFILENODE_GETRELNODE(rnode), forkNames[forknum], blk); else - printf(", blkref #%d: rel %u/%u/%u blk %u", + printf(", blkref #%d: rel %u/%u/" UINT64_FORMAT "blk %u", block_id, - rnode.spcNode, rnode.dbNode, rnode.relNode, + rnode.spcNode, rnode.dbNode, + RELFILENODE_GETRELNODE(rnode), blk); if (XLogRecHasBlockImage(record, block_id)) { @@ -548,9 +550,9 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record) continue; XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blk); - printf("\tblkref #%d: rel %u/%u/%u fork %s blk %u", + printf("\tblkref #%d: rel %u/%u/" UINT64_FORMAT " fork %s blk %u", block_id, - rnode.spcNode, rnode.dbNode, rnode.relNode, + rnode.spcNode, rnode.dbNode, RELFILENODE_GETRELNODE(rnode), forkNames[forknum], blk); if (XLogRecHasBlockImage(record, block_id)) diff --git a/src/common/relpath.c b/src/common/relpath.c index 636c96e..0a458d8 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -138,7 +138,7 @@ GetDatabasePath(Oid dbNode, Oid spcNode) * the trouble considering BackendId is just int anyway. */ char * -GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, +GetRelationPath(Oid dbNode, Oid spcNode, uint64 relNode, int backendId, ForkNumber forkNumber) { char *path; @@ -149,10 +149,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", + path = psprintf("global/" UINT64_FORMAT "_%s", relNode, forkNames[forkNumber]); else - path = psprintf("global/%u", relNode); + path = psprintf("global/" UINT64_FORMAT, relNode); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,21 +160,21 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/" UINT64_FORMAT "_%s", dbNode, relNode, forkNames[forkNumber]); else - path = psprintf("base/%u/%u", + path = psprintf("base/%u/" UINT64_FORMAT, dbNode, relNode); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_" UINT64_FORMAT "_%s", dbNode, backendId, relNode, forkNames[forkNumber]); else - path = psprintf("base/%u/t%d_%u", + path = psprintf("base/%u/t%d_" UINT64_FORMAT, dbNode, backendId, relNode); } } @@ -184,24 +184,24 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/" UINT64_FORMAT "_%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, forkNames[forkNumber]); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/" UINT64_FORMAT, spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_" UINT64_FORMAT "_%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, forkNames[forkNumber]); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_" UINT64_FORMAT, spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode); } diff --git a/src/include/access/transam.h b/src/include/access/transam.h index 9a2816d..2e68920 100644 --- a/src/include/access/transam.h +++ b/src/include/access/transam.h @@ -15,6 +15,7 @@ #define TRANSAM_H #include "access/xlogdefs.h" +#include "storage/relfilenode.h" /* ---------------- @@ -195,6 +196,7 @@ FullTransactionIdAdvance(FullTransactionId *dest) #define FirstGenbkiObjectId 10000 #define FirstUnpinnedObjectId 12000 #define FirstNormalObjectId 16384 +#define FirstNormalRelfileNode 1 /* OIDs of Template0 and Postgres database are fixed */ #define Template0ObjectId 4 @@ -217,6 +219,9 @@ typedef struct VariableCacheData */ Oid nextOid; /* next OID to assign */ uint32 oidCount; /* OIDs available before must do XLOG work */ + RelNode nextRelNode; /* next relfilenode to assign */ + uint32 relnodecount; /* Relfilenode available before must do XLOG + work */ /* * These fields are protected by XidGenLock. @@ -298,6 +303,7 @@ extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid, extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid); extern bool ForceTransactionIdLimitUpdate(void); extern Oid GetNewObjectId(void); +extern RelNode GetNewRelNode(void); extern void StopGeneratingPinnedObjectIds(void); #ifdef USE_ASSERT_CHECKING diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index bb0c526..04f0cd6 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -304,6 +304,7 @@ extern bool CreateRestartPoint(int flags); extern WALAvailability GetWALAvailability(XLogRecPtr targetLSN); extern XLogRecPtr CalculateMaxmumSafeLSN(void); extern void XLogPutNextOid(Oid nextOid); +extern void XLogPutNextRelFileNode(RelNode nextrelnode); extern XLogRecPtr XLogRestorePoint(const char *rpName); extern void UpdateFullPageWrites(void); extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p); diff --git a/src/include/catalog/catalog.h b/src/include/catalog/catalog.h index 60c1215..1b83c79 100644 --- a/src/include/catalog/catalog.h +++ b/src/include/catalog/catalog.h @@ -15,6 +15,7 @@ #define CATALOG_H #include "catalog/pg_class.h" +#include "storage/relfilenode.h" #include "utils/relcache.h" @@ -38,7 +39,6 @@ extern bool IsPinnedObject(Oid classId, Oid objectId); extern Oid GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn); -extern Oid GetNewRelFileNode(Oid reltablespace, Relation pg_class, - char relpersistence); +extern RelNode GetNewRelFileNode(Oid reltablespace, char relpersistence); #endif /* CATALOG_H */ diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h index 304e8c1..4659ed3 100644 --- a/src/include/catalog/pg_class.h +++ b/src/include/catalog/pg_class.h @@ -52,13 +52,13 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat /* access method; 0 if not a table / index */ Oid relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am); - /* identifier of physical storage file */ - /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */ - Oid relfilenode BKI_DEFAULT(0); - /* identifier of table space for relation (0 means default for database) */ Oid reltablespace BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_tablespace); + /* identifier of physical storage file */ + /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */ + int64 relfilenode BKI_DEFAULT(0); + /* # of blocks (not always up-to-date) */ int32 relpages BKI_DEFAULT(0); @@ -154,7 +154,7 @@ typedef FormData_pg_class *Form_pg_class; DECLARE_UNIQUE_INDEX_PKEY(pg_class_oid_index, 2662, ClassOidIndexId, on pg_class using btree(oid oid_ops)); DECLARE_UNIQUE_INDEX(pg_class_relname_nsp_index, 2663, ClassNameNspIndexId, on pg_class using btree(relname name_ops, relnamespace oid_ops)); -DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3455, ClassTblspcRelfilenodeIndexId, on pg_class using btree(reltablespace oid_ops, relfilenode oid_ops)); +DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3455, ClassTblspcRelfilenodeIndexId, on pg_class using btree(reltablespace oid_ops, relfilenode int8_ops)); #ifdef EXPOSE_TO_CLIENT_CODE diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h index 1f3dc24..27d584d 100644 --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -41,6 +41,7 @@ typedef struct CheckPoint * timeline (equals ThisTimeLineID otherwise) */ bool fullPageWrites; /* current full_page_writes */ FullTransactionId nextXid; /* next free transaction ID */ + RelNode nextRelNode; /* next relfile node */ Oid nextOid; /* next free OID */ MultiXactId nextMulti; /* next free MultiXactId */ MultiXactOffset nextMultiOffset; /* next free MultiXact offset */ @@ -78,6 +79,7 @@ typedef struct CheckPoint #define XLOG_FPI 0xB0 /* 0xC0 is used in Postgres 9.5-11 */ #define XLOG_OVERWRITE_CONTRECORD 0xD0 +#define XLOG_NEXT_RELFILENODE 0xE0 /* diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 5d4037f..167655e 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -66,7 +66,7 @@ extern void SetRelationHasSubclass(Oid relationId, bool relhassubclass); extern bool CheckRelationTableSpaceMove(Relation rel, Oid newTableSpaceId); extern void SetRelationTableSpace(Relation rel, Oid newTableSpaceId, - Oid newRelFileNode); + uint64 newRelFileNode); extern ObjectAddress renameatt(RenameStmt *stmt); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a4b5dc8..3756364 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -66,7 +66,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); */ extern char *GetDatabasePath(Oid dbNode, Oid spcNode); -extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, +extern char *GetRelationPath(Oid dbNode, Oid spcNode, uint64 relNode, int backendId, ForkNumber forkNumber); /* @@ -76,8 +76,8 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ - GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + GetRelationPath((rnode).dbNode, (rnode).spcNode, \ + RELFILENODE_GETRELNODE((rnode)), backend, forknum) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index b903d2b..293dc90 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -21,6 +21,7 @@ #include "storage/condition_variable.h" #include "storage/latch.h" #include "storage/lwlock.h" +#include "storage/relfilenode.h" #include "storage/shmem.h" #include "storage/smgr.h" #include "storage/spin.h" @@ -91,7 +92,6 @@ typedef struct buftag { RelFileNode rnode; /* physical relation identifier */ - ForkNumber forkNum; BlockNumber blockNum; /* blknum relative to begin of reln */ } BufferTag; @@ -99,23 +99,23 @@ typedef struct buftag ( \ (a).rnode.spcNode = InvalidOid, \ (a).rnode.dbNode = InvalidOid, \ - (a).rnode.relNode = InvalidOid, \ - (a).forkNum = InvalidForkNumber, \ + RELFILENODE_SETRELNODE((a).rnode, 0), \ + RELFILENODE_SETFORKNUM((a).rnode, InvalidForkNumber), \ (a).blockNum = InvalidBlockNumber \ ) #define INIT_BUFFERTAG(a,xx_rnode,xx_forkNum,xx_blockNum) \ ( \ (a).rnode = (xx_rnode), \ - (a).forkNum = (xx_forkNum), \ - (a).blockNum = (xx_blockNum) \ + (a).blockNum = (xx_blockNum), \ + RELFILENODE_SETFORKNUM((a).rnode, (xx_forkNum)) \ ) #define BUFFERTAGS_EQUAL(a,b) \ ( \ RelFileNodeEquals((a).rnode, (b).rnode) && \ (a).blockNum == (b).blockNum && \ - (a).forkNum == (b).forkNum \ + RELFILENODE_GETFORKNUM((a).rnode) == RELFILENODE_GETFORKNUM((b).rnode) \ ) /* diff --git a/src/include/storage/relfilenode.h b/src/include/storage/relfilenode.h index 4fdc606..57b1c2c 100644 --- a/src/include/storage/relfilenode.h +++ b/src/include/storage/relfilenode.h @@ -17,6 +17,27 @@ #include "common/relpath.h" #include "storage/backendid.h" +/* FIXME: where to keep this typedef. */ +typedef uint64 RelNode; + +#ifdef __cplusplus +#define InvalidRelfileNode (RelNode(0)) +#else +#define InvalidRelfileNode ((RelNode) 0) +#endif + +/* + * RelNodeId: + * + * this is a storage type for RelNode. The reasoning behind using this is same + * as using the BlockId so refer comment atop BlockId. + */ +typedef struct RelNodeId +{ + uint32 rn_hi; + uint32 rn_lo; +} RelNodeId; + /* * RelFileNode must provide all that we need to know to physically access * a relation, with the exception of the backend ID, which can be provided @@ -58,7 +79,7 @@ typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ - Oid relNode; /* relation */ + RelNodeId relNode; /* relation */ } RelFileNode; /* @@ -86,14 +107,53 @@ typedef struct RelFileNodeBackend * RelFileNodeBackendEquals. */ #define RelFileNodeEquals(node1, node2) \ - ((node1).relNode == (node2).relNode && \ + ((RELFILENODE_GETRELNODE((node1)) == RELFILENODE_GETRELNODE((node2))) && \ (node1).dbNode == (node2).dbNode && \ (node1).spcNode == (node2).spcNode) #define RelFileNodeBackendEquals(node1, node2) \ - ((node1).node.relNode == (node2).node.relNode && \ + (RELFILENODE_GETRELNODE((node1)) == RELFILENODE_GETRELNODE((node2)) && \ (node1).node.dbNode == (node2).node.dbNode && \ (node1).backend == (node2).backend && \ (node1).node.spcNode == (node2).node.spcNode) +/* + * These macros define the relNode filed of the RelFileNode, 8 high order bita + * defines the fork no and remaining 56 bits define the relfilenode. + */ +#define RELFILENODE_RELNODE_BITS 56 +#define RELFILENODE_RELNODE_MASK ((((uint64) 1) << RELFILENODE_RELNODE_BITS) - 1) +#define RELFILENODE_RELNODE_MASK1 ((((uint32) 1) << 24) - 1) + +/* Getting and setitng RelNode from RelNodeId. */ +#define RELNODEID_GET_RELNODE(rnode) \ + (uint64) (((uint64) (rnode).rn_hi << 32) | ((uint32) (rnode).rn_lo)) + +#define RELNODEID_SET_RELNODE(rnode, val) \ +( \ + (rnode).rn_hi = (val) >> 32, \ + (rnode).rn_lo = (val) & 0xffffffff \ +) + +/* + * Macros to get and set the relNode and forkNum inside RelFileNode.relNode. + */ +#define RELFILENODE_GETRELNODE(rnode) \ + (RELNODEID_GET_RELNODE((rnode).relNode) & RELFILENODE_RELNODE_MASK) + +#define RELFILENODE_GETFORKNUM(rnode) \ + (RELNODEID_GET_RELNODE((rnode).relNode) >> RELFILENODE_RELNODE_BITS) + +#define RELFILENODE_SETRELNODE(rnode, val) \ + RELNODEID_SET_RELNODE((rnode).relNode, (val) & RELFILENODE_RELNODE_MASK) + +#define RELFILENODE_SETFORKNUM(rnode, forkNum) \ + RELNODEID_SET_RELNODE((rnode).relNode, \ + (RELNODEID_GET_RELNODE((rnode).relNode)) | \ + ((uint64) (forkNum) << RELFILENODE_RELNODE_BITS)) + +/* Clear fork number from RelFileNode.relNode. */ +#define RELFILENODE_CLEARFORKNUM(rnode) \ + RELFILENODE_SETRELNODE(rnode, RELFILENODE_GETRELNODE(rnode)) + #endif /* RELFILENODE_H */ diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h index 9737e1e..4d67850 100644 --- a/src/include/storage/sync.h +++ b/src/include/storage/sync.h @@ -57,7 +57,6 @@ typedef struct FileTag extern void InitSync(void); extern void SyncPreCheckpoint(void); -extern void SyncPostCheckpoint(void); extern void ProcessSyncRequests(void); extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type); extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type, diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h index 9fbb5a7..58234a8 100644 --- a/src/include/utils/relmapper.h +++ b/src/include/utils/relmapper.h @@ -35,11 +35,11 @@ typedef struct xl_relmap_update #define MinSizeOfRelmapUpdate offsetof(xl_relmap_update, data) -extern Oid RelationMapOidToFilenode(Oid relationId, bool shared); +extern RelNode RelationMapOidToFilenode(Oid relationId, bool shared); -extern Oid RelationMapFilenodeToOid(Oid relationId, bool shared); +extern Oid RelationMapFilenodeToOid(RelNode relationId, bool shared); -extern void RelationMapUpdateMap(Oid relationId, Oid fileNode, bool shared, +extern void RelationMapUpdateMap(Oid relationId, RelNode fileNode, bool shared, bool immediate); extern void RelationMapRemoveMapping(Oid relationId); diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index 16e0475..3de3b1c 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -2175,10 +2175,10 @@ select relname, relname | orig_oid | storage | desc ------------------------------+----------+---------+--------------- at_partitioned | t | none | - at_partitioned_0 | t | own | - at_partitioned_0_id_name_key | t | own | child 0 index - at_partitioned_1 | t | own | - at_partitioned_1_id_name_key | t | own | child 1 index + at_partitioned_0 | t | orig | + at_partitioned_0_id_name_key | t | orig | child 0 index + at_partitioned_1 | t | orig | + at_partitioned_1_id_name_key | t | orig | child 1 index at_partitioned_id_name_key | t | none | parent index (6 rows) @@ -2209,10 +2209,10 @@ select relname, relname | orig_oid | storage | desc ------------------------------+----------+---------+-------------- at_partitioned | t | none | - at_partitioned_0 | t | own | - at_partitioned_0_id_name_key | f | own | parent index - at_partitioned_1 | t | own | - at_partitioned_1_id_name_key | f | own | parent index + at_partitioned_0 | t | orig | + at_partitioned_0_id_name_key | f | OTHER | parent index + at_partitioned_1 | t | orig | + at_partitioned_1_id_name_key | f | OTHER | parent index at_partitioned_id_name_key | f | none | parent index (6 rows) -- 1.8.3.1