Hello, As per the earlier discussions, I've attached the updated patch for WAL consistency check feature. This is how the patch works:
- If WAL consistency check is enabled for a rmgrID, we always include the backup image in the WAL record. - I've extended the RmgrTable with a new function pointer rm_checkConsistency, which is called after rm_redo. (only when WAL consistency check is enabled for this rmgrID) - In each rm_checkConsistency, both backup pages and buffer pages are masked accordingly before any comparison. - In postgresql.conf, a new guc variable named 'wal_consistency' is added. Default value of this variable is 'None'. Valid values are combinations of Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist, BRIN, Generic and XLOG. It can also be set to 'All' to enable all the values. - In recovery tests (src/test/recovery/t), I've added wal_consistency parameter in the existing scripts. This feature doesn't change the expected output. If there is any inconsistency, it can be verified in corresponding log file. Results ------------------------ I've tested with installcheck and installcheck-world in master-standby set-up. Followings are the configuration parameters. Master: wal_level = replica max_wal_senders = 3 wal_keep_segments = 4000 hot_standby = on wal_consistency = 'All' Standby: wal_consistency = 'All' I got two types of inconsistencies as following: 1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after redo, only BTP_DELETED flag is set in buffer page. I assume that we should clear all btpo_flags before setting BTP_DELETED in _bt_unlink_halfdead_page(). 2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in REVMAP page. This happens only for two cases. I'm not sure what the reason can be. I haven't done sufficient tests yet to measure the overhead of this modification. I'll do that next. Thanks to Amit Kapila, Dilip Kumar and Robert Haas for their off-line suggestions. Thoughts? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com On Thu, Sep 1, 2016 at 11:34 PM, Peter Geoghegan <p...@heroku.com> wrote: > On Thu, Sep 1, 2016 at 9:23 AM, Robert Haas <robertmh...@gmail.com> wrote: >> Indeed, it had occurred to me that we might not even want to compile >> this code into the server unless WAL_DEBUG is defined; after all, how >> does it help a regular user to detect that the server has a bug? Bug >> or no bug, that's the code they've got. But on further reflection, it >> seems like it could be useful: if we suspect a bug in the redo code >> but we can't reproduce it here, we could ask the customer to turn this >> option on to see whether it produces logging indicating the nature of >> the problem. However, because of the likely expensive of enabling the >> feature, it seems like it would be quite desirable to limit the >> expense of generating many extra FPWs to the affected rmgr. For >> example, if a user has a table with a btree index and a gin index, and >> we suspect a bug in GIN, it would be nice for the user to be able to >> enable the feature *only for GIN* rather than paying the cost of >> enabling it for btree and heap as well.[2] > > Yes, that would be rather a large advantage. > > I think that there really is no hard distinction between users and > hackers. Some people will want to run this in production, and it would > be a lot better if performance was at least not atrocious. If amcheck > couldn't do the majority of its verification with only an > AccessShareLock, then users probably just couldn't use it. Heroku > wouldn't have been able to use it on all production databases. It > wouldn't have mattered that the verification was no less effective, > since the bugs it found would simply never have been observed in > practice. > > -- > Peter Geoghegan -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c index 27ba0a9..4c63ded 100644 --- a/src/backend/access/brin/brin_xlog.c +++ b/src/backend/access/brin/brin_xlog.c @@ -14,7 +14,7 @@ #include "access/brin_pageops.h" #include "access/brin_xlog.h" #include "access/xlogutils.h" - +#include "storage/bufmask.h" /* * xlog replay routines @@ -286,3 +286,84 @@ brin_redo(XLogReaderState *record) elog(PANIC, "brin_redo: unknown op code %u", info); } } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +brin_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Mask pages */ + norm_new_page = mask_brin_page(info, blkno, new_page); + norm_old_page = mask_brin_page(info, blkno, old_page); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c index a40f168..09760e0 100644 --- a/src/backend/access/gin/ginxlog.c +++ b/src/backend/access/gin/ginxlog.c @@ -15,6 +15,7 @@ #include "access/gin_private.h" #include "access/xlogutils.h" +#include "storage/bufmask.h" #include "utils/memutils.h" static MemoryContext opCtx; /* working memory for operations */ @@ -758,3 +759,84 @@ gin_xlog_cleanup(void) MemoryContextDelete(opCtx); opCtx = NULL; } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +gin_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Mask Pages */ + norm_new_page = mask_gin_page(info, blkno, new_page); + norm_old_page = mask_gin_page(info, blkno, old_page); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c index 01c7ef7..5ba8ea0 100644 --- a/src/backend/access/gist/gistxlog.c +++ b/src/backend/access/gist/gistxlog.c @@ -16,6 +16,7 @@ #include "access/gist_private.h" #include "access/xloginsert.h" #include "access/xlogutils.h" +#include "storage/bufmask.h" #include "utils/memutils.h" static MemoryContext opCtx; /* working memory for operations */ @@ -420,3 +421,86 @@ gistXLogUpdate(Buffer buffer, return recptr; } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +gist_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Mask pages */ + norm_new_page = mask_gist_page(info, blkno, new_page); + norm_old_page = mask_gist_page(info, blkno, old_page); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + { + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u, block_id %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno, block_id); + } + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index e3b1eef..3e34c59 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -21,10 +21,12 @@ #include "access/hash.h" #include "access/hash_xlog.h" #include "access/relscan.h" +#include "access/xlogutils.h" #include "catalog/index.h" #include "commands/vacuum.h" #include "miscadmin.h" #include "optimizer/plancat.h" +#include "storage/bufmask.h" #include "utils/index_selfuncs.h" #include "utils/rel.h" @@ -711,3 +713,84 @@ hash_redo(XLogReaderState *record) { elog(PANIC, "hash_redo: unimplemented"); } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +hash_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Mask pages */ + norm_new_page = mask_hash_page(info, blkno, new_page); + norm_old_page = mask_hash_page(info, blkno, old_page); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 6a27ef4..d56324b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -58,6 +58,7 @@ #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" +#include "storage/bufmask.h" #include "storage/freespace.h" #include "storage/lmgr.h" #include "storage/predicate.h" @@ -9120,3 +9121,84 @@ heap_sync(Relation rel) heap_close(toastrel, AccessShareLock); } } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +heap_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Mask pages */ + norm_new_page = mask_heap_page(info, blkno, new_page); + norm_old_page = mask_heap_page(info, blkno, old_page); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index c536e22..7425a47 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -19,6 +19,7 @@ #include "access/transam.h" #include "access/xlog.h" #include "access/xlogutils.h" +#include "storage/bufmask.h" #include "storage/procarray.h" #include "miscadmin.h" @@ -1028,3 +1029,88 @@ btree_redo(XLogReaderState *record) elog(PANIC, "btree_redo: unknown op code %u", info); } } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +btree_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + /* No redo for the following type */ + if (info == XLOG_BTREE_UNLINK_PAGE) + return; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Mask pages */ + norm_new_page = mask_btree_page(info, blkno, new_page); + norm_old_page = mask_btree_page(info, blkno, old_page); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c index e016cdb..e972695 100644 --- a/src/backend/access/spgist/spgxlog.c +++ b/src/backend/access/spgist/spgxlog.c @@ -18,6 +18,7 @@ #include "access/transam.h" #include "access/xlog.h" #include "access/xlogutils.h" +#include "storage/bufmask.h" #include "storage/standby.h" #include "utils/memutils.h" @@ -1023,3 +1024,84 @@ spg_xlog_cleanup(void) MemoryContextDelete(opCtx); opCtx = NULL; } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +spg_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Mask pages */ + norm_new_page = mask_spg_page(info, blkno, new_page); + norm_old_page = mask_spg_page(info, blkno, old_page); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c index 1926d98..ec55181 100644 --- a/src/backend/access/transam/generic_xlog.c +++ b/src/backend/access/transam/generic_xlog.c @@ -16,6 +16,7 @@ #include "access/generic_xlog.h" #include "access/xlogutils.h" #include "miscadmin.h" +#include "storage/bufmask.h" #include "utils/memutils.h" /*------------------------------------------------------------------------- @@ -533,3 +534,88 @@ generic_redo(XLogReaderState *record) UnlockReleaseBuffer(buffers[block_id]); } } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +generic_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* + * At present, generic xlog is used only by bloom index. + * We are masking it as common page. It can be changed + * if required. + */ + norm_new_page = mask_common_page(info, blkno, new_page, true, true); + norm_old_page = mask_common_page(info, blkno, old_page, true, true); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c index 9bb1362..7e85c2b 100644 --- a/src/backend/access/transam/rmgr.c +++ b/src/backend/access/transam/rmgr.c @@ -26,12 +26,13 @@ #include "commands/tablespace.h" #include "replication/message.h" #include "replication/origin.h" +#include "storage/bufmask.h" #include "storage/standby.h" #include "utils/relmapper.h" /* must be kept in sync with RmgrData definition in xlog_internal.h */ -#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \ - { name, redo, desc, identify, startup, cleanup }, +#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,checkConsistency) \ + { name, redo, desc, identify, startup, cleanup, checkConsistency }, const RmgrData RmgrTable[RM_MAX_ID + 1] = { #include "access/rmgrlist.h" diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 2189c22..5ad6228 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -25,6 +25,7 @@ #include "access/commit_ts.h" #include "access/multixact.h" #include "access/rewriteheap.h" +#include "access/rmgr.h" #include "access/subtrans.h" #include "access/timeline.h" #include "access/transam.h" @@ -53,6 +54,8 @@ #include "replication/walsender.h" #include "storage/barrier.h" #include "storage/bufmgr.h" +#include "storage/bufmask.h" +#include "storage/bufpage.h" #include "storage/fd.h" #include "storage/ipc.h" #include "storage/large_object.h" @@ -95,6 +98,8 @@ bool EnableHotStandby = false; bool fullPageWrites = true; bool wal_log_hints = false; bool wal_compression = false; +char *wal_consistency_string = NULL; +bool *wal_consistency = NULL; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; int wal_level = WAL_LEVEL_MINIMAL; @@ -6944,6 +6949,14 @@ StartupXLOG(void) /* Now apply the WAL record itself */ RmgrTable[record->xl_rmid].rm_redo(xlogreader); + /* + * After redo, check whether the backup pages associated with the WAL record + * are consistenct with the existing pages. This check is done only + * if consistency check is enabled for the corresponding rmid. + */ + if(wal_consistency[record->xl_rmid]) + RmgrTable[record->xl_rmid].rm_checkConsistency(xlogreader); + /* Pop the error context stack */ error_context_stack = errcallback.previous; @@ -11708,3 +11721,87 @@ XLogRequestWalReceiverReply(void) { doRequestWalReceiverReply = true; } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +xlog_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + /* in XLOG rmgr, backup blocks are only used by XLOG_FPI records */ + if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT) + { + old_page = (Page) palloc(BLCKSZ); + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Mask pages */ + norm_new_page = mask_common_page(info, blkno, new_page, false, false); + norm_old_page = mask_common_page(info, blkno, old_page, false, false); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); + } +} diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index 3cd273b..2f7c36b 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -556,7 +556,11 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT) bkpb.fork_flags |= BKPBLOCK_WILL_INIT; - if (needs_backup) + /* + * If wal consistency check is enabled for current rmid, + * we do fpw for the current block. + */ + if (needs_backup || wal_consistency[rmid]) { Page page = regbuf->page; uint16 compressed_len; @@ -608,7 +612,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, * Fill in the remaining fields in the XLogRecordBlockHeader * struct */ - bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE; + + /* + * Remember that, if WAL consistency check is enabled for the current rmid, + * we always include backup image with the WAL record. If needs_backup is enabled, + * only then set BKPBLOCK_HAS_IMAGE flag. During redo, this flag is used + * to set has_image flag in DecodedBkpBlock. We don't want to set + * this flag unnecessarily, since this will restore the page during redo. + */ + if (needs_backup) + bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE; /* * Construct XLogRecData entries for the page content. @@ -680,7 +693,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, /* Ok, copy the header to the scratch buffer */ memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader); scratch += SizeOfXLogRecordBlockHeader; - if (needs_backup) + if (needs_backup || wal_consistency[rmid]) { memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader); scratch += SizeOfXLogRecordBlockImageHeader; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index f2da505..2f6b51e 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -1026,6 +1026,12 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg) uint32 datatotal; RelFileNode *rnode = NULL; uint8 block_id; + bool checkConsistency = false; + + #ifndef FRONTEND + /* Check whether wal consistency check is enabled for the current rmid.*/ + checkConsistency = wal_consistency[record->xl_rmid]; + #endif ResetDecoder(state); @@ -1114,7 +1120,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg) } datatotal += blk->data_len; - if (blk->has_image) + /* + * If wal consistency check is enabled, then it will always + * have a backup image. + */ + if (blk->has_image || checkConsistency) { COPY_HEADER_FIELD(&blk->bimg_len, sizeof(uint16)); COPY_HEADER_FIELD(&blk->hole_offset, sizeof(uint16)); @@ -1242,7 +1252,11 @@ DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, char **errormsg) if (!blk->in_use) continue; - if (blk->has_image) + /* + * If wal consistency check is enabled, then it will always + * have a backup image. + */ + if (blk->has_image || checkConsistency) { blk->bkp_image = ptr; ptr += blk->bimg_len; diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c index c98f981..a7349b2 100644 --- a/src/backend/commands/sequence.c +++ b/src/backend/commands/sequence.c @@ -31,6 +31,7 @@ #include "funcapi.h" #include "miscadmin.h" #include "nodes/makefuncs.h" +#include "storage/bufmask.h" #include "storage/lmgr.h" #include "storage/proc.h" #include "storage/smgr.h" @@ -49,16 +50,6 @@ #define SEQ_LOG_VALS 32 /* - * The "special area" of a sequence's buffer page looks like this. - */ -#define SEQ_MAGIC 0x1717 - -typedef struct sequence_magic -{ - uint32 magic; -} sequence_magic; - -/* * We store a SeqTable item for every sequence we have touched in the current * session. This is needed to hold onto nextval/currval state. (We can't * rely on the relcache, since it's only, well, a cache, and may decide to @@ -329,7 +320,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple) { Buffer buf; Page page; - sequence_magic *sm; + SequencePageOpaqueData *sm; OffsetNumber offnum; /* Initialize first page of relation with special magic number */ @@ -339,9 +330,9 @@ fill_seq_with_data(Relation rel, HeapTuple tuple) page = BufferGetPage(buf); - PageInit(page, BufferGetPageSize(buf), sizeof(sequence_magic)); - sm = (sequence_magic *) PageGetSpecialPointer(page); - sm->magic = SEQ_MAGIC; + PageInit(page, BufferGetPageSize(buf), sizeof(SequencePageOpaqueData)); + sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page); + sm->seq_page_id = SEQ_MAGIC; /* Now insert sequence tuple */ @@ -1109,18 +1100,18 @@ read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple) { Page page; ItemId lp; - sequence_magic *sm; + SequencePageOpaqueData *sm; Form_pg_sequence seq; *buf = ReadBuffer(rel, 0); LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE); page = BufferGetPage(*buf); - sm = (sequence_magic *) PageGetSpecialPointer(page); + sm = (SequencePageOpaqueData *) PageGetSpecialPointer(page); - if (sm->magic != SEQ_MAGIC) + if (sm->seq_page_id != SEQ_MAGIC) elog(ERROR, "bad magic number in sequence \"%s\": %08X", - RelationGetRelationName(rel), sm->magic); + RelationGetRelationName(rel), sm->seq_page_id); lp = PageGetItemId(page, FirstOffsetNumber); Assert(ItemIdIsNormal(lp)); @@ -1585,7 +1576,7 @@ seq_redo(XLogReaderState *record) char *item; Size itemsz; xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record); - sequence_magic *sm; + SequencePageOpaqueData *sm; if (info != XLOG_SEQ_LOG) elog(PANIC, "seq_redo: unknown op code %u", info); @@ -1604,9 +1595,9 @@ seq_redo(XLogReaderState *record) */ localpage = (Page) palloc(BufferGetPageSize(buffer)); - PageInit(localpage, BufferGetPageSize(buffer), sizeof(sequence_magic)); - sm = (sequence_magic *) PageGetSpecialPointer(localpage); - sm->magic = SEQ_MAGIC; + PageInit(localpage, BufferGetPageSize(buffer), sizeof(SequencePageOpaqueData)); + sm = (SequencePageOpaqueData *) PageGetSpecialPointer(localpage); + sm->seq_page_id = SEQ_MAGIC; item = (char *) xlrec + sizeof(xl_seq_rec); itemsz = XLogRecGetDataLen(record) - sizeof(xl_seq_rec); @@ -1638,3 +1629,87 @@ ResetSequenceCaches(void) last_used_seq = NULL; } + +/* + * It checks whether the current buffer page and backup page stored in the + * WAL record are consistent or not. Before comparing the two pages, it applies + * appropiate masking to the pages to ignore certain areas like hint bits, + * unused space between pd_lower and pd_upper etc. For more information about + * masking, see the masking function. + * This function should be called once WAL replay has been completed. + */ +void +seq_checkConsistency(XLogReaderState *record) +{ + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; + int block_id; + RelFileNode rnode; + ForkNumber forknum; + BlockNumber blkno; + int inconsistent_loc; + bool has_image; + Page new_page, old_page; + + old_page = (Page) palloc(BLCKSZ); + + for (block_id = 0; block_id <= record->max_block_id; block_id++) + { + Buffer buf; + char *norm_new_page, *norm_old_page; + + if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno)) + { + /* Caller specified a bogus block_id. Don't do anything. */ + continue; + } + /* + * Read the contents from the current buffer + * and store it in a temporary page. + */ + buf = XLogReadBufferExtended(rnode, forknum, blkno, + RBM_NORMAL); + if (!BufferIsValid(buf)) + continue; + new_page = BufferGetPage(buf); + + /* + * Read the contents from the backup copy, stored in WAL record + * and store it in a temporary page. Before restoring, set + * has_image value as true, since RestoreBlockImage checks + * this flag. After restoring the image, restore the value of + * has_image flag. + */ + has_image = record->blocks[block_id].has_image; + record->blocks[block_id].has_image = true; + if (!RestoreBlockImage(record, block_id, old_page)) + elog(ERROR, "failed to restore block image"); + record->blocks[block_id].has_image = has_image; + + /* Since, we always reinit the page in seq_redo, there is no need + * to handle any special cases during masking. We can use common + * mask function to mask seq pages. + */ + norm_new_page = mask_common_page(info, blkno, new_page, true, true); + norm_old_page = mask_common_page(info, blkno, old_page, true, true); + + /* Time to compare the old and new contents */ + inconsistent_loc = comparePages(norm_new_page, norm_old_page); + + if (inconsistent_loc < BLCKSZ) + elog(WARNING, + "Inconsistent page (at byte %u) found, rel %u/%u/%u, " + "forknum %u, blkno %u", inconsistent_loc, + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + else + elog(DEBUG1, + "Consistent page found, rel %u/%u/%u, " + "forknum %u, blkno %u", + rnode.spcNode, rnode.dbNode, rnode.relNode, + forknum, blkno); + pfree(norm_new_page); + pfree(norm_old_page); + ReleaseBuffer(buf); + } + pfree(old_page); +} diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile index 2c10fba..8630dca 100644 --- a/src/backend/storage/buffer/Makefile +++ b/src/backend/storage/buffer/Makefile @@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global -OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o +OBJS = buf_table.o buf_init.o bufmask.o bufmgr.o freelist.o localbuf.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/storage/buffer/bufmask.c b/src/backend/storage/buffer/bufmask.c new file mode 100644 index 0000000..6b86379 --- /dev/null +++ b/src/backend/storage/buffer/bufmask.c @@ -0,0 +1,468 @@ +/*------------------------------------------------------------------------- + * + * bufmask.c + * Routines for buffer masking, used to ensure that buffers used for + * comparison across nodes are in a consistent state. + * + * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * Most pages cannot be compared directly, because some parts of the + * page are not expected to be byte-by-byte identical. For example, + * hint bits or unused space in the page. The strategy is to normalize + * all pages by creating a mask of those bits that are not expected to + * match. + * + * IDENTIFICATION + * src/backend/storage/buffer/bufmask.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "access/brin_page.h" +#include "access/nbtree.h" +#include "access/gist.h" +#include "access/gin_private.h" +#include "access/hash.h" +#include "access/htup_details.h" +#include "access/spgist_private.h" +#include "commands/sequence.h" +#include "storage/bufmask.h" +#include "storage/bufmgr.h" + +/* Marker used to mask pages consistently */ +#define MASK_MARKER 0xFF + +static void mask_page_lsn(Page page); +static void mask_page_hint_bits(Page page); +static void mask_unused_space(Page page); + +/* + * Mask Page LSN + */ +static void +mask_page_lsn(Page page) +{ + PageHeader phdr = (PageHeader) page; + PageXLogRecPtrSet(phdr->pd_lsn, 0xFFFFFFFFFFFFFFFF); +} + +/* + * Mask Page hint bits + */ +static void +mask_page_hint_bits(Page page) +{ + PageHeader phdr = (PageHeader) page; + + /* Ignore prune_xid (it's like a hint-bit) */ + phdr->pd_prune_xid = 0xFFFFFFFF; + + /* Ignore PD_PAGE_FULL and PD_HAS_FREE_LINES flags, they are just hints */ + phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES; + + /* + * Also mask the all-visible flag. + * + * XXX: It is unfortunate that we have to do this. If the flag is set + * incorrectly, that's serious, and we would like to catch it. If the flag + * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN + * records don't currently set the flag, even though it is set in the + * master, so we must silence failures that that causes. + */ + phdr->pd_flags |= PD_ALL_VISIBLE; +} +/* + * Mask the unused space of a page between pd_lower and pd_upper. + */ +static void +mask_unused_space(Page page) +{ + int pd_lower = ((PageHeader) page)->pd_lower; + int pd_upper = ((PageHeader) page)->pd_upper; + int pd_special = ((PageHeader) page)->pd_special; + + /* Sanity check */ + if (pd_lower > pd_upper || pd_special < pd_upper || + pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ) + { + elog(ERROR, "invalid page pd_lower %u pd_upper %u pd_special %u\n", + pd_lower, pd_upper, pd_special); + } + + memset(page + pd_lower, MASK_MARKER, pd_upper - pd_lower); +} + +/* + * Mask a heap page + */ +char * +mask_heap_page(uint8 info, BlockNumber blkno, const char *page) +{ + Page page_norm; + OffsetNumber off; + + page_norm = (Page) palloc(BLCKSZ); + memcpy(page_norm, page, BLCKSZ); + + /* + * Mask the Page LSN. Because, we store the page before updating the LSN. + * Hence, LSNs of both pages will always be different. + */ + mask_page_lsn(page_norm); + + mask_page_hint_bits(page_norm); + mask_unused_space(page_norm); + + for (off = 1; off <= PageGetMaxOffsetNumber(page_norm); off++) + { + ItemId iid = PageGetItemId(page, off); + char *page_item; + + page_item = (char *) (page_norm + ItemIdGetOffset(iid)); + + /* + * Ignore hint bits and command ID. + */ + if (ItemIdIsNormal(iid)) + { + HeapTupleHeader page_htup = (HeapTupleHeader) page_item; + + page_htup->t_infomask = + HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID | + HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID; + page_htup->t_infomask |= HEAP_XACT_MASK; + page_htup->t_choice.t_heap.t_field3.t_cid = 0xFFFFFFFF; + + /* + * For a speculative tuple, the content of t_ctid is conflicting + * between the backup page and current page. Hence, I set it + * to current block number and offset. Need suggestions! + */ + if (HeapTupleHeaderIsSpeculative(page_htup)) + { + ItemPointerSet(&page_htup->t_ctid, blkno, off); + } + } + + /* + * Ignore any padding bytes after the tuple, when the length of + * the item is not MAXALIGNed. + */ + if (ItemIdHasStorage(iid)) + { + int len = ItemIdGetLength(iid); + int padlen = MAXALIGN(len) - len; + + if (padlen > 0) + memset(page_item + len, MASK_MARKER, padlen); + } + } + return (char *)page_norm; +} + +/* + * Mask a btree page + */ +char * +mask_btree_page(uint8 info, BlockNumber blkno, const char *page) +{ + Page page_norm; + OffsetNumber off; + OffsetNumber maxoff; + BTPageOpaque maskopaq; + + page_norm = (Page) palloc(BLCKSZ); + memcpy(page_norm, page, BLCKSZ); + + /* + * Mask the Page LSN. Because, we store the page before updating the LSN. + * Hence, LSNs of both pages will always be different. + */ + mask_page_lsn(page_norm); + + mask_page_hint_bits(page_norm); + mask_unused_space(page_norm); + + maskopaq = (BTPageOpaque) + (((char *) page_norm) + ((PageHeader) page_norm)->pd_special); + /* + * Mask everything on a DELETED page. + */ + if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags & BTP_DELETED) + { + /* Page content, between standard page header and opaque struct */ + memset(page_norm + SizeOfPageHeaderData, MASK_MARKER, + BLCKSZ - MAXALIGN(sizeof(BTPageOpaqueData)) - SizeOfPageHeaderData); + + /* pd_lower and upper */ + memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16)); + memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16)); + } + else + { + /* + * Mask some line pointer bits, particularly those marked as + * used on a master and unused on a standby. + * XXX: This could be refined. + */ + maxoff = PageGetMaxOffsetNumber(page_norm); + for (off = 1; off <= maxoff; off++) + { + ItemId iid = PageGetItemId(page_norm, off); + + if (ItemIdIsUsed(iid)) + iid->lp_flags = LP_UNUSED; + } + } + + maskopaq->btpo_flags |= BTP_SPLIT_END | BTP_HAS_GARBAGE; + maskopaq->btpo_cycleid = 0; + + return (char *)page_norm; +} + +/* + * Mask a hash page + */ +char * +mask_hash_page(uint8 info, BlockNumber blkno, const char *page) +{ + Page page_norm; + OffsetNumber off; + OffsetNumber maxoff; + HashPageOpaque opaque; + + page_norm = (Page) palloc(BLCKSZ); + memcpy(page_norm, page, BLCKSZ); + + /* + * Mask the Page LSN. Because, we store the page before updating the LSN. + * Hence, LSNs of both pages will always be different. + */ + mask_page_lsn(page_norm); + + mask_page_hint_bits(page_norm); + mask_unused_space(page_norm); + + opaque = (HashPageOpaque) PageGetSpecialPointer(page_norm); + /* + * Mask everything on a UNUSED page. + */ + if (opaque->hasho_flag & LH_UNUSED_PAGE) + { + /* Page content, between standard page header and opaque struct */ + memset(page_norm + SizeOfPageHeaderData, MASK_MARKER, + BLCKSZ - MAXALIGN(sizeof(HashPageOpaqueData)) - SizeOfPageHeaderData); + + /* pd_lower and upper */ + memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER, sizeof(uint16)); + memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER, sizeof(uint16)); + } + else if ((opaque->hasho_flag & LH_META_PAGE)==0) + { + /* + * For pages other than metapage, + * Mask some line pointer bits, particularly those marked as + * used on a master and unused on a standby. + * XXX: This could be refined. + */ + maxoff = PageGetMaxOffsetNumber(page_norm); + for (off = 1; off <= maxoff; off++) + { + ItemId iid = PageGetItemId(page_norm, off); + + if (ItemIdIsUsed(iid)) + iid->lp_flags = LP_UNUSED; + } + } + return (char *)page_norm; +} + +/* + * Mask a SpGist page + */ +char * +mask_spg_page(uint8 info, BlockNumber blkno, const char *page) +{ + Page page_norm; + + page_norm = (Page) palloc(BLCKSZ); + memcpy(page_norm, page, BLCKSZ); + + /* + * Mask the Page LSN. Because, we store the page before updating the LSN. + * Hence, LSNs of both pages will always be different. + */ + mask_page_lsn(page_norm); + + mask_page_hint_bits(page_norm); + + if (!SpGistPageIsMeta(page_norm)) + mask_unused_space(page_norm); + + return (char *)page_norm; +} + +/* + * Mask a GIST page + */ +char * +mask_gist_page(uint8 info, BlockNumber blkno, const char *page) +{ + Page page_norm; + OffsetNumber offnum, + maxoff; + + page_norm = (Page) palloc(BLCKSZ); + memcpy(page_norm, page, BLCKSZ); + + /* + * Mask the Page LSN. Because, we store the page before updating the LSN. + * Hence, LSNs of both pages will always be different. + */ + mask_page_lsn(page_norm); + + mask_page_hint_bits(page_norm); + mask_unused_space(page_norm); + + /*Mask NSN*/ + GistPageSetNSN(page_norm, 0xFFFFFFFFFFFFFFFF); + + /* + * We update F_FOLLOW_RIGHT flag on the left child after writing WAL record. + * Hence, mask this flag. + */ + GistMarkFollowRight(page_norm); + + if (GistPageIsLeaf(page_norm)) + { + /* + * For gist leaf pages, + * Mask some line pointer bits, particularly those marked as + * used on a master and unused on a standby. + * XXX: This could be refined. + */ + maxoff = PageGetMaxOffsetNumber(page_norm); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page_norm, offnum); + + if (ItemIdIsUsed(itemId)) + itemId->lp_flags = LP_UNUSED; + } + } + + /* In Gist redo, we never mark a page as garbage. Hence, Mask It.*/ + GistClearPageHasGarbage(page_norm); + return (char *)page_norm; +} + +/* + * Mask a Gin page + */ +char * +mask_gin_page(uint8 info, BlockNumber blkno, const char *page) +{ + Page page_norm; + GinPageOpaque opaque; + + page_norm = (Page) palloc(BLCKSZ); + memcpy(page_norm, page, BLCKSZ); + + /* + * Mask the Page LSN. Because, we store the page before updating the LSN. + * Hence, LSNs of both pages will always be different. + */ + mask_page_lsn(page_norm); + opaque = GinPageGetOpaque(page_norm); + + /* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. */ + if (blkno != 0) + { + mask_page_hint_bits(page_norm); + + /* + * For GIN_DELETED page, the page is initialized to empty. + * Hence mask everything. + */ + if (opaque->flags & GIN_DELETED) + memset(page_norm, MASK_MARKER, BLCKSZ); + else + mask_unused_space(page_norm); + } + + return (char *)page_norm; +} + +/* + * Mask a BRIN page + */ +char * +mask_brin_page(uint8 info, BlockNumber blkno, const char *page) +{ + Page page_norm; + OffsetNumber offnum, + maxoff; + + page_norm = (Page) palloc(BLCKSZ); + memcpy(page_norm, page, BLCKSZ); + + /* + * Mask the Page LSN. Because, we store the page before updating the LSN. + * Hence, LSNs of both pages will always be different. + */ + mask_page_lsn(page_norm); + + mask_page_hint_bits(page_norm); + + if (BRIN_IS_REGULAR_PAGE(page_norm)) + { + mask_unused_space(page_norm); + + maxoff = PageGetMaxOffsetNumber(page_norm); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page_norm, offnum); + + if (ItemIdIsUsed(itemId)) + itemId->lp_flags = LP_UNUSED; + } + } + + /* We need to handle brin pages of type Meta and Revmap if needed */ + + return (char *)page_norm; +} + +/* + * Mask a common page + */ +char * +mask_common_page(uint8 info, BlockNumber blkno, const char *page, bool maskHints, bool maskUnusedSpace) +{ + Page page_norm; + + page_norm = (Page) palloc(BLCKSZ); + memcpy(page_norm, page, BLCKSZ); + + /* + * Mask the Page LSN. Because, we store the page before updating the LSN. + * Hence, LSNs of both pages will always be different. + */ + mask_page_lsn(page_norm); + + if(maskHints) + mask_page_hint_bits(page_norm); + + if(maskUnusedSpace) + mask_unused_space(page_norm); + + return (char *)page_norm; +} diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index f2a07f2..cc35fc4 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -1134,3 +1134,47 @@ PageSetChecksumInplace(Page page, BlockNumber blkno) ((PageHeader) page)->pd_checksum = pg_checksum_page((char *) page, blkno); } + +/* + * Compare the contents of two pages. + * If the two pages are exactly same, it returns BLCKSZ. Otherwise, + * it returns the location where the first mismatch has occurred. + */ +int +comparePages(char *page1, char *page2) +{ + char buf1[BLCKSZ * 2]; + char buf2[BLCKSZ * 2]; + int j = 0; + int i; + + /* + * Convert the pages to be compared into hex format to facilitate + * their comparison and make potential diffs more readable while + * debugging. + */ + for (i = 0; i < BLCKSZ ; i++) + { + const char *digits = "0123456789ABCDEF"; + uint8 byte1 = (uint8) page1[i]; + uint8 byte2 = (uint8) page2[i]; + + buf1[j] = digits[byte1 >> 4]; + buf2[j] = digits[byte2 >> 4]; + + if (buf1[j] != buf2[j]) + { + break; + } + j++; + + buf1[j] = digits[byte1 & 0x0F]; + buf2[j] = digits[byte2 & 0x0F]; + if (buf1[j] != buf2[j]) + { + break; + } + j++; + } + return i; +} diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index c5178f7..71baf0a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -144,6 +144,9 @@ static bool call_enum_check_hook(struct config_enum * conf, int *newval, static bool check_log_destination(char **newval, void **extra, GucSource source); static void assign_log_destination(const char *newval, void *extra); +static bool check_wal_consistency(char **newval, void **extra, GucSource source); +static void assign_wal_consistency(const char *newval, void *extra); + #ifdef HAVE_SYSLOG static int syslog_facility = LOG_LOCAL0; #else @@ -3248,6 +3251,17 @@ static struct config_string ConfigureNamesString[] = }, { + {"wal_consistency", PGC_POSTMASTER, WAL_SETTINGS, + gettext_noop("Sets the rmgrIDs for which WAL consistency should be checked."), + gettext_noop("Valid values are combinations of rmgrIDs"), + GUC_LIST_INPUT + }, + &wal_consistency_string, + "NONE", + check_wal_consistency, assign_wal_consistency, NULL + }, + + { {"log_destination", PGC_SIGHUP, LOGGING_WHERE, gettext_noop("Sets the destination for server log output."), gettext_noop("Valid values are combinations of \"stderr\", " @@ -3259,6 +3273,7 @@ static struct config_string ConfigureNamesString[] = "stderr", check_log_destination, assign_log_destination, NULL }, + { {"log_directory", PGC_SIGHUP, LOGGING_WHERE, gettext_noop("Sets the destination directory for log files."), @@ -9903,6 +9918,128 @@ assign_log_destination(const char *newval, void *extra) Log_destination = *((int *) extra); } +static bool +check_wal_consistency(char **newval, void **extra, GucSource source) +{ + char *rawstring; + List *elemlist; + ListCell *l; + bool *newwalconsistency; + int i; + + newwalconsistency = (bool *) guc_malloc(ERROR,(RM_MAX_ID + 1)*sizeof(bool)); + + /* Initialize the array*/ + for(i = 0; i < RM_MAX_ID + 1 ; i++) + newwalconsistency[i] = false; + + /* Need a modifiable copy of string */ + rawstring = pstrdup(*newval); + + /* Parse string into list of identifiers */ + if (!SplitIdentifierString(rawstring, ',', &elemlist)) + { + /* syntax error in list */ + GUC_check_errdetail("List syntax is invalid."); + pfree(rawstring); + list_free(elemlist); + return false; + } + + foreach(l, elemlist) + { + char *tok = (char *) lfirst(l); + if (pg_strcasecmp(tok, "Heap2") == 0) + { + newwalconsistency[RM_HEAP2_ID] = true; + } + else if (pg_strcasecmp(tok, "Heap") == 0) + { + newwalconsistency[RM_HEAP_ID] = true; + } + else if (pg_strcasecmp(tok, "Btree") == 0) + { + newwalconsistency[RM_BTREE_ID] = true; + } + else if (pg_strcasecmp(tok, "Hash") == 0) + { + newwalconsistency[RM_HASH_ID] = true; + } + else if (pg_strcasecmp(tok, "Gin") == 0) + { + newwalconsistency[RM_GIN_ID] = true; + } + else if (pg_strcasecmp(tok, "Gist") == 0) + { + newwalconsistency[RM_GIST_ID] = true; + } + else if (pg_strcasecmp(tok, "Sequence") == 0) + { + newwalconsistency[RM_SEQ_ID] = true; + } + else if (pg_strcasecmp(tok, "SPGist") == 0) + { + newwalconsistency[RM_SPGIST_ID] = true; + } + else if (pg_strcasecmp(tok, "BRIN") == 0) + { + newwalconsistency[RM_BRIN_ID] = true; + } + else if (pg_strcasecmp(tok, "Generic") == 0) + { + newwalconsistency[RM_GENERIC_ID] = true; + } + else if (pg_strcasecmp(tok, "XLOG") == 0) + { + newwalconsistency[RM_XLOG_ID] = true; + } + else if (pg_strcasecmp(tok, "NONE") == 0) + { + for(i = 0; i < RM_MAX_ID + 1 ; i++) + newwalconsistency[i] = false; + break; + } + else if (pg_strcasecmp(tok, "ALL") == 0) + { + /* + * Followings are the rmids which can have backup blocks. + * We'll enable this feature only for these rmids. + */ + newwalconsistency[RM_HEAP2_ID] = true; + newwalconsistency[RM_HEAP_ID] = true; + newwalconsistency[RM_BTREE_ID] = true; + newwalconsistency[RM_HASH_ID] = true; + newwalconsistency[RM_GIN_ID] = true; + newwalconsistency[RM_GIST_ID] = true; + newwalconsistency[RM_SEQ_ID] = true; + newwalconsistency[RM_SPGIST_ID] = true; + newwalconsistency[RM_BRIN_ID] = true; + newwalconsistency[RM_GENERIC_ID] = true; + newwalconsistency[RM_XLOG_ID] = true; + } + else + { + GUC_check_errdetail("Unrecognized key word: \"%s\".", tok); + pfree(rawstring); + list_free(elemlist); + return false; + } + } + + pfree(rawstring); + list_free(elemlist); + + *extra = (void *) newwalconsistency; + + return true; +} + +static void +assign_wal_consistency(const char *newval, void *extra) +{ + wal_consistency = (bool *) extra; +} + static void assign_syslog_facility(int newval, void *extra) { diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 6d0666c..e1f688e 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -191,6 +191,11 @@ # open_sync #full_page_writes = on # recover from partial page writes #wal_compression = off # enable compression of full-page writes +#wal_consistency = 'none' # Valid values are combinations of + # Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, + # SPGist, BRIN, Generic and XLOG. It can also + # be set to ALL to enable all the values. + # (change requires restart) #wal_log_hints = off # also do full page writes of non-critical updates # (change requires restart) #wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index b53591d..e9c7914 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -29,7 +29,7 @@ * RmgrNames is an array of resource manager names, to make error messages * a bit nicer. */ -#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \ +#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,check) \ name, static const char *RmgrNames[RM_MAX_ID + 1] = { diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c index 8fe20ce..8418281 100644 --- a/src/bin/pg_xlogdump/rmgrdesc.c +++ b/src/bin/pg_xlogdump/rmgrdesc.c @@ -32,7 +32,7 @@ #include "storage/standbydefs.h" #include "utils/relmapper.h" -#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \ +#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,check) \ { name, desc, identify}, const RmgrDescData RmgrDescTable[RM_MAX_ID + 1] = { diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h index f614805..d99dd42 100644 --- a/src/include/access/brin_xlog.h +++ b/src/include/access/brin_xlog.h @@ -128,5 +128,6 @@ typedef struct xl_brin_revmap_extend extern void brin_redo(XLogReaderState *record); extern void brin_desc(StringInfo buf, XLogReaderState *record); extern const char *brin_identify(uint8 info); +extern void brin_checkConsistency(XLogReaderState *record); #endif /* BRIN_XLOG_H */ diff --git a/src/include/access/generic_xlog.h b/src/include/access/generic_xlog.h index 63f2120..a8ecd35 100644 --- a/src/include/access/generic_xlog.h +++ b/src/include/access/generic_xlog.h @@ -40,5 +40,6 @@ extern void GenericXLogAbort(GenericXLogState *state); extern void generic_redo(XLogReaderState *record); extern const char *generic_identify(uint8 info); extern void generic_desc(StringInfo buf, XLogReaderState *record); +extern void generic_checkConsistency(XLogReaderState *record); #endif /* GENERIC_XLOG_H */ diff --git a/src/include/access/gin.h b/src/include/access/gin.h index e5b2e10..c5e80fd 100644 --- a/src/include/access/gin.h +++ b/src/include/access/gin.h @@ -80,4 +80,5 @@ extern const char *gin_identify(uint8 info); extern void gin_xlog_startup(void); extern void gin_xlog_cleanup(void); +extern void gin_checkConsistency(XLogReaderState *record); #endif /* GIN_H */ diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h index 1231585..3ad246b 100644 --- a/src/include/access/gist_private.h +++ b/src/include/access/gist_private.h @@ -464,6 +464,7 @@ extern void gist_desc(StringInfo buf, XLogReaderState *record); extern const char *gist_identify(uint8 info); extern void gist_xlog_startup(void); extern void gist_xlog_cleanup(void); +extern void gist_checkConsistency(XLogReaderState *record); extern XLogRecPtr gistXLogUpdate(Buffer buffer, OffsetNumber *todelete, int ntodelete, diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h index 5f941a9..28f8aca 100644 --- a/src/include/access/hash_xlog.h +++ b/src/include/access/hash_xlog.h @@ -21,5 +21,6 @@ extern void hash_redo(XLogReaderState *record); extern void hash_desc(StringInfo buf, XLogReaderState *record); extern const char *hash_identify(uint8 info); +extern void hash_checkConsistency(XLogReaderState *record); #endif /* HASH_XLOG_H */ diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h index 06a8242..c52e27c 100644 --- a/src/include/access/heapam_xlog.h +++ b/src/include/access/heapam_xlog.h @@ -398,4 +398,5 @@ extern void heap_execute_freeze_tuple(HeapTupleHeader tuple, extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags); +extern void heap_checkConsistency(XLogReaderState *record); #endif /* HEAPAM_XLOG_H */ diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index c580f51..8e5f1fc 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -776,4 +776,5 @@ extern void btree_redo(XLogReaderState *record); extern void btree_desc(StringInfo buf, XLogReaderState *record); extern const char *btree_identify(uint8 info); +extern void btree_checkConsistency(XLogReaderState *record); #endif /* NBTREE_H */ diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h index ff7fe62..3e6d014 100644 --- a/src/include/access/rmgr.h +++ b/src/include/access/rmgr.h @@ -19,7 +19,7 @@ typedef uint8 RmgrId; * Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG * file format. */ -#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \ +#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,checkConsistency) \ symname, typedef enum RmgrIds diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h index a7a0ae2..9ff80f3 100644 --- a/src/include/access/rmgrlist.h +++ b/src/include/access/rmgrlist.h @@ -25,25 +25,25 @@ */ /* symbol name, textual name, redo, desc, identify, startup, cleanup */ -PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL) -PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL) -PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL) -PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL) -PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL) -PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL) -PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL) -PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL) -PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL) -PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL) -PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL) -PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL) -PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL) -PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup) -PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup) -PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL) -PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup) -PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL) -PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL) -PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL) -PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL) -PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL) +PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, xlog_checkConsistency) +PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL) +PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL) +PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL) +PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL) +PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL) +PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL) +PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL) +PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL) +PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_checkConsistency) +PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_checkConsistency) +PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_checkConsistency) +PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_checkConsistency) +PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_checkConsistency) +PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_checkConsistency) +PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_checkConsistency) +PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_checkConsistency) +PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_checkConsistency) +PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL) +PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL) +PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_checkConsistency) +PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL) diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h index a953a5a..edd224c 100644 --- a/src/include/access/spgist.h +++ b/src/include/access/spgist.h @@ -220,5 +220,6 @@ extern void spg_desc(StringInfo buf, XLogReaderState *record); extern const char *spg_identify(uint8 info); extern void spg_xlog_startup(void); extern void spg_xlog_cleanup(void); +extern void spg_checkConsistency(XLogReaderState *record); #endif /* SPGIST_H */ diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index c9f332c..d19b9ec 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -105,6 +105,8 @@ extern bool EnableHotStandby; extern bool fullPageWrites; extern bool wal_log_hints; extern bool wal_compression; +extern bool *wal_consistency; +extern char *wal_consistency_string; extern bool log_checkpoints; extern int CheckPointSegments; @@ -274,6 +276,8 @@ extern void XLogRequestWalReceiverReply(void); extern void assign_max_wal_size(int newval, void *extra); extern void assign_checkpoint_completion_target(double newval, void *extra); +extern void xlog_checkConsistency(XLogReaderState *record); + /* * Starting/stopping a base backup */ diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 0a595cc..e9d210f 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -276,6 +276,7 @@ typedef struct RmgrData const char *(*rm_identify) (uint8 info); void (*rm_startup) (void); void (*rm_cleanup) (void); + void (*rm_checkConsistency) (XLogReaderState *record); } RmgrData; extern const RmgrData RmgrTable[]; diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index deaa7f5..287143b 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -52,6 +52,8 @@ typedef struct /* Information on full-page image, if any */ bool has_image; + bool require_image; /* This field contains the true value of has_image. + Because, if wal consistency check is enabled, has_image will always be true.*/ char *bkp_image; uint16 hole_offset; uint16 hole_length; diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h index 3dfcb49..34e28c0 100644 --- a/src/include/access/xlogrecord.h +++ b/src/include/access/xlogrecord.h @@ -137,7 +137,7 @@ typedef struct XLogRecordBlockImageHeader /* Information stored in bimg_info */ #define BKPIMAGE_HAS_HOLE 0x01 /* page image has "hole" */ #define BKPIMAGE_IS_COMPRESSED 0x02 /* page image is compressed */ - +#define BKPIMAGE_IS_REQUIRED 0x04 /* page is required by the WAL record */ /* * Extra header information used when page image has "hole" and * is compressed. diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h index 6af60d8..26895fc 100644 --- a/src/include/commands/sequence.h +++ b/src/include/commands/sequence.h @@ -20,6 +20,19 @@ #include "nodes/parsenodes.h" #include "storage/relfilenode.h" +/* + * Page opaque data in a sequence page + */ +typedef struct SequencePageOpaqueData +{ + uint32 seq_page_id; +} SequencePageOpaqueData; + +/* + * This page ID is for the conveniende to be able to identify if a page + * is being used by a sequence. + */ +#define SEQ_MAGIC 0x1717 typedef struct FormData_pg_sequence { @@ -81,5 +94,6 @@ extern void ResetSequenceCaches(void); extern void seq_redo(XLogReaderState *rptr); extern void seq_desc(StringInfo buf, XLogReaderState *rptr); extern const char *seq_identify(uint8 info); +extern void seq_checkConsistency(XLogReaderState *record); #endif /* SEQUENCE_H */ diff --git a/src/include/storage/bufmask.h b/src/include/storage/bufmask.h new file mode 100644 index 0000000..b8d850a --- /dev/null +++ b/src/include/storage/bufmask.h @@ -0,0 +1,28 @@ +/*------------------------------------------------------------------------- + * + * bufmask.h + * Buffer masking definitions. + * + * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/bufmask.h + */ + +#ifndef BUFMASK_H +#define BUFMASK_H + +#include "postgres.h" +#include "storage/block.h" + +/* Entry point for page masking */ +extern char *mask_page(RmgrIds rmid, uint8 info, BlockNumber blkno, const char *page); +extern char *mask_heap_page(uint8 info, BlockNumber blkno, const char *page); +extern char *mask_btree_page(uint8 info, BlockNumber blkno, const char *page); +extern char *mask_hash_page(uint8 info, BlockNumber blkno, const char *page); +extern char *mask_spg_page(uint8 info, BlockNumber blkno, const char *page); +extern char *mask_gist_page(uint8 info, BlockNumber blkno, const char *page); +extern char *mask_gin_page(uint8 info, BlockNumber blkno, const char *page); +extern char *mask_brin_page(uint8 info, BlockNumber blkno, const char *page); +extern char *mask_common_page(uint8 info, BlockNumber blkno, const char *page, bool maskHints, bool maskUnusedSpace); +#endif diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h index 15cebfc..b754134 100644 --- a/src/include/storage/bufpage.h +++ b/src/include/storage/bufpage.h @@ -432,4 +432,5 @@ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, extern char *PageSetChecksumCopy(Page page, BlockNumber blkno); extern void PageSetChecksumInplace(Page page, BlockNumber blkno); +extern int comparePages(Page norm_new_page, Page norm_old_page); #endif /* BUFPAGE_H */ diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl index fd71095..3050dd8 100644 --- a/src/test/recovery/t/001_stream_rep.pl +++ b/src/test/recovery/t/001_stream_rep.pl @@ -8,6 +8,10 @@ use Test::More tests => 4; # Initialize master node my $node_master = get_new_node('master'); $node_master->init(allows_streaming => 1); +$node_master->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_master->start; my $backup_name = 'my_backup'; @@ -18,6 +22,10 @@ $node_master->backup($backup_name); my $node_standby_1 = get_new_node('standby_1'); $node_standby_1->init_from_backup($node_master, $backup_name, has_streaming => 1); +$node_standby_1->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby_1->start; # Take backup of standby 1 (not mandatory, but useful to check if @@ -28,6 +36,10 @@ $node_standby_1->backup($backup_name); my $node_standby_2 = get_new_node('standby_2'); $node_standby_2->init_from_backup($node_standby_1, $backup_name, has_streaming => 1); +$node_standby_2->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby_2->start; # Create some content on master and check its presence in standby 1 diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl index fc2bf7e..ed9da1d 100644 --- a/src/test/recovery/t/002_archiving.pl +++ b/src/test/recovery/t/002_archiving.pl @@ -11,6 +11,10 @@ my $node_master = get_new_node('master'); $node_master->init( has_archiving => 1, allows_streaming => 1); +$node_master->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); my $backup_name = 'my_backup'; # Start it @@ -27,6 +31,10 @@ $node_standby->append_conf( 'postgresql.conf', qq( wal_retrieve_retry_interval = '100ms' )); +$node_standby->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby->start; # Create some content on master diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl index a82545b..6452086 100644 --- a/src/test/recovery/t/003_recovery_targets.pl +++ b/src/test/recovery/t/003_recovery_targets.pl @@ -27,7 +27,10 @@ sub test_recovery_standby qq($param_item )); } - + $node_standby->append_conf( + 'postgresql.conf', qq( + wal_consistency = 'All' + )); $node_standby->start; # Wait until standby has replayed enough data @@ -48,7 +51,10 @@ sub test_recovery_standby # Initialize master node my $node_master = get_new_node('master'); $node_master->init(has_archiving => 1, allows_streaming => 1); - +$node_master->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); # Start it $node_master->start; diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl index 3ee8df2..42c4257 100644 --- a/src/test/recovery/t/004_timeline_switch.pl +++ b/src/test/recovery/t/004_timeline_switch.pl @@ -13,6 +13,10 @@ $ENV{PGDATABASE} = 'postgres'; # Initialize master node my $node_master = get_new_node('master'); $node_master->init(allows_streaming => 1); +$node_master->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_master->start; # Take backup @@ -23,10 +27,18 @@ $node_master->backup($backup_name); my $node_standby_1 = get_new_node('standby_1'); $node_standby_1->init_from_backup($node_master, $backup_name, has_streaming => 1); +$node_standby_1->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby_1->start; my $node_standby_2 = get_new_node('standby_2'); $node_standby_2->init_from_backup($node_master, $backup_name, has_streaming => 1); +$node_standby_2->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby_2->start; # Create some content on master diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl index 640295b..b782cc2 100644 --- a/src/test/recovery/t/005_replay_delay.pl +++ b/src/test/recovery/t/005_replay_delay.pl @@ -9,6 +9,10 @@ use Test::More tests => 1; # Initialize master node my $node_master = get_new_node('master'); $node_master->init(allows_streaming => 1); +$node_master->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_master->start; # And some content @@ -28,6 +32,10 @@ $node_standby->append_conf( 'recovery.conf', qq( recovery_min_apply_delay = '${delay}s' )); +$node_standby->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby->start; # Make new content on master and check its presence in standby depending diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl index b80a9a9..63a10c4 100644 --- a/src/test/recovery/t/006_logical_decoding.pl +++ b/src/test/recovery/t/006_logical_decoding.pl @@ -13,6 +13,10 @@ $node_master->append_conf( max_replication_slots = 4 wal_level = logical )); +$node_master->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_master->start; my $backup_name = 'master_backup'; diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl index 0c87226..5911d65 100644 --- a/src/test/recovery/t/007_sync_rep.pl +++ b/src/test/recovery/t/007_sync_rep.pl @@ -46,6 +46,10 @@ sub test_sync_state # Initialize master node my $node_master = get_new_node('master'); $node_master->init(allows_streaming => 1); +$node_master->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_master->start; my $backup_name = 'master_backup'; @@ -56,18 +60,30 @@ $node_master->backup($backup_name); my $node_standby_1 = get_new_node('standby1'); $node_standby_1->init_from_backup($node_master, $backup_name, has_streaming => 1); +$node_standby_1->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby_1->start; # Create standby2 linking to master my $node_standby_2 = get_new_node('standby2'); $node_standby_2->init_from_backup($node_master, $backup_name, has_streaming => 1); +$node_standby_2->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby_2->start; # Create standby3 linking to master my $node_standby_3 = get_new_node('standby3'); $node_standby_3->init_from_backup($node_master, $backup_name, has_streaming => 1); +$node_standby_3->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby_3->start; # Check that sync_state is determined correctly when @@ -116,6 +132,10 @@ $node_standby_1->start; my $node_standby_4 = get_new_node('standby4'); $node_standby_4->init_from_backup($node_master, $backup_name, has_streaming => 1); +$node_standby_4->append_conf( + 'postgresql.conf', qq( +wal_consistency = 'All' +)); $node_standby_4->start; # Check that standby1 and standby2 whose names appear earlier in
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers