Attached is a first draft of an update to pg_filedump for 9.3. I know pg_filedump is a pgfoundry project, but that seems like it's just there to host the download; so please excuse the slightly off-topic post here on -hackers.
I made a few changes to support 9.3, which were mostly fixes related two things: * new htup_details.h and changes related to FK concurrency improvements * XLogRecPtr is now a uint64 And, of course, I added support for checksums. They are always displayed and calculated, but it only throws an error if you pass "-k". Only the user knows whether checksums are enabled, because we removed page-level bits indicating the presence of a checksum. The patch is a bit ugly: I had to copy some code, and copy the entire checksum.c file (minus some Asserts, which don't work in an external program). Suggestions welcome. Regards, Jeff Davis
diff -Nc pg_filedump-9.2.0/checksum.c pg_filedump-9.3.0j/checksum.c *** pg_filedump-9.2.0/checksum.c 1969-12-31 16:00:00.000000000 -0800 --- pg_filedump-9.3.0j/checksum.c 2013-06-09 21:20:34.036176831 -0700 *************** *** 0 **** --- 1,157 ---- + /*------------------------------------------------------------------------- + * + * checksum.c + * Checksum implementation for data pages. + * + * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/storage/page/checksum.c + * + *------------------------------------------------------------------------- + * + * Checksum algorithm + * + * The algorithm used to checksum pages is chosen for very fast calculation. + * Workloads where the database working set fits into OS file cache but not + * into shared buffers can read in pages at a very fast pace and the checksum + * algorithm itself can become the largest bottleneck. + * + * The checksum algorithm itself is based on the FNV-1a hash (FNV is shorthand + * for Fowler/Noll/Vo) The primitive of a plain FNV-1a hash folds in data 1 + * byte at a time according to the formula: + * + * hash = (hash ^ value) * FNV_PRIME + * + * FNV-1a algorithm is described at http://www.isthe.com/chongo/tech/comp/fnv/ + * + * PostgreSQL doesn't use FNV-1a hash directly because it has bad mixing of + * high bits - high order bits in input data only affect high order bits in + * output data. To resolve this we xor in the value prior to multiplication + * shifted right by 17 bits. The number 17 was chosen because it doesn't + * have common denominator with set bit positions in FNV_PRIME and empirically + * provides the fastest mixing for high order bits of final iterations quickly + * avalanche into lower positions. For performance reasons we choose to combine + * 4 bytes at a time. The actual hash formula used as the basis is: + * + * hash = (hash ^ value) * FNV_PRIME ^ ((hash ^ value) >> 17) + * + * The main bottleneck in this calculation is the multiplication latency. To + * hide the latency and to make use of SIMD parallelism multiple hash values + * are calculated in parallel. The page is treated as a 32 column two + * dimensional array of 32 bit values. Each column is aggregated separately + * into a partial checksum. Each partial checksum uses a different initial + * value (offset basis in FNV terminology). The initial values actually used + * were chosen randomly, as the values themselves don't matter as much as that + * they are different and don't match anything in real data. After initializing + * partial checksums each value in the column is aggregated according to the + * above formula. Finally two more iterations of the formula are performed with + * value 0 to mix the bits of the last value added. + * + * The partial checksums are then folded together using xor to form a single + * 32-bit checksum. The caller can safely reduce the value to 16 bits + * using modulo 2^16-1. That will cause a very slight bias towards lower + * values but this is not significant for the performance of the + * checksum. + * + * The algorithm choice was based on what instructions are available in SIMD + * instruction sets. This meant that a fast and good algorithm needed to use + * multiplication as the main mixing operator. The simplest multiplication + * based checksum primitive is the one used by FNV. The prime used is chosen + * for good dispersion of values. It has no known simple patterns that result + * in collisions. Test of 5-bit differentials of the primitive over 64bit keys + * reveals no differentials with 3 or more values out of 100000 random keys + * colliding. Avalanche test shows that only high order bits of the last word + * have a bias. Tests of 1-4 uncorrelated bit errors, stray 0 and 0xFF bytes, + * overwriting page from random position to end with 0 bytes, and overwriting + * random segments of page with 0x00, 0xFF and random data all show optimal + * 2e-16 false positive rate within margin of error. + * + * Vectorization of the algorithm requires 32bit x 32bit -> 32bit integer + * multiplication instruction. As of 2013 the corresponding instruction is + * available on x86 SSE4.1 extensions (pmulld) and ARM NEON (vmul.i32). + * Vectorization requires a compiler to do the vectorization for us. For recent + * GCC versions the flags -msse4.1 -funroll-loops -ftree-vectorize are enough + * to achieve vectorization. + * + * The optimal amount of parallelism to use depends on CPU specific instruction + * latency, SIMD instruction width, throughput and the amount of registers + * available to hold intermediate state. Generally, more parallelism is better + * up to the point that state doesn't fit in registers and extra load-store + * instructions are needed to swap values in/out. The number chosen is a fixed + * part of the algorithm because changing the parallelism changes the checksum + * result. + * + * The parallelism number 32 was chosen based on the fact that it is the + * largest state that fits into architecturally visible x86 SSE registers while + * leaving some free registers for intermediate values. For future processors + * with 256bit vector registers this will leave some performance on the table. + * When vectorization is not available it might be beneficial to restructure + * the computation to calculate a subset of the columns at a time and perform + * multiple passes to avoid register spilling. This optimization opportunity + * is not used. Current coding also assumes that the compiler has the ability + * to unroll the inner loop to avoid loop overhead and minimize register + * spilling. For less sophisticated compilers it might be beneficial to manually + * unroll the inner loop. + */ + #include "postgres.h" + + #include "storage/checksum.h" + + /* number of checksums to calculate in parallel */ + #define N_SUMS 32 + /* prime multiplier of FNV-1a hash */ + #define FNV_PRIME 16777619 + + /* + * Base offsets to initialize each of the parallel FNV hashes into a + * different initial state. + */ + static const uint32 checksumBaseOffsets[N_SUMS] = { + 0x5B1F36E9, 0xB8525960, 0x02AB50AA, 0x1DE66D2A, + 0x79FF467A, 0x9BB9F8A3, 0x217E7CD2, 0x83E13D2C, + 0xF8D4474F, 0xE39EB970, 0x42C6AE16, 0x993216FA, + 0x7B093B5D, 0x98DAFF3C, 0xF718902A, 0x0B1C9CDB, + 0xE58F764B, 0x187636BC, 0x5D7B3BB1, 0xE73DE7DE, + 0x92BEC979, 0xCCA6C0B2, 0x304A0979, 0x85AA43D4, + 0x783125BB, 0x6CA8EAA2, 0xE407EAC6, 0x4B5CFC3E, + 0x9FBF8C76, 0x15CA20BE, 0xF2CA9FD3, 0x959BD756 + }; + + /* + * Calculate one round of the checksum. + */ + #define CHECKSUM_COMP(checksum, value) do {\ + uint32 __tmp = (checksum) ^ (value);\ + (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17);\ + } while (0) + + uint32 + checksum_block(char *data, uint32 size) + { + uint32 sums[N_SUMS]; + uint32 (*dataArr)[N_SUMS] = (uint32 (*)[N_SUMS]) data; + uint32 result = 0; + int i, j; + + /* initialize partial checksums to their corresponding offsets */ + memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); + + /* main checksum calculation */ + for (i = 0; i < size/sizeof(uint32)/N_SUMS; i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], dataArr[i][j]); + + /* finally add in two rounds of zeroes for additional mixing */ + for (i = 0; i < 2; i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], 0); + + /* xor fold partial checksums together */ + for (i = 0; i < N_SUMS; i++) + result ^= sums[i]; + + return result; + } Common subdirectories: pg_filedump-9.2.0/.deps and pg_filedump-9.3.0j/.deps diff -Nc pg_filedump-9.2.0/Makefile pg_filedump-9.3.0j/Makefile *** pg_filedump-9.2.0/Makefile 2012-03-12 09:02:44.000000000 -0700 --- pg_filedump-9.3.0j/Makefile 2013-06-09 21:15:43.908182347 -0700 *************** *** 1,7 **** # View README.pg_filedump first # note this must match version macros in pg_filedump.h ! FD_VERSION=9.2.0 CC=gcc CFLAGS=-g -O -Wall -Wmissing-prototypes -Wmissing-declarations --- 1,7 ---- # View README.pg_filedump first # note this must match version macros in pg_filedump.h ! FD_VERSION=9.3.0 CC=gcc CFLAGS=-g -O -Wall -Wmissing-prototypes -Wmissing-declarations *************** *** 17,28 **** all: pg_filedump ! pg_filedump: pg_filedump.o ${CC} ${CFLAGS} -o pg_filedump pg_filedump.o pg_filedump.o: pg_filedump.c ${CC} ${CFLAGS} -I${PGSQL_INCLUDE_DIR} pg_filedump.c -c dist: rm -rf pg_filedump-${FD_VERSION} pg_filedump-${FD_VERSION}.tar.gz mkdir pg_filedump-${FD_VERSION} --- 17,31 ---- all: pg_filedump ! pg_filedump: pg_filedump.o checksum.o ${CC} ${CFLAGS} -o pg_filedump pg_filedump.o pg_filedump.o: pg_filedump.c ${CC} ${CFLAGS} -I${PGSQL_INCLUDE_DIR} pg_filedump.c -c + checksum.o: checksum.c + ${CC} ${CFLAGS} -I${PGSQL_INCLUDE_DIR} checksum.c -c + dist: rm -rf pg_filedump-${FD_VERSION} pg_filedump-${FD_VERSION}.tar.gz mkdir pg_filedump-${FD_VERSION} diff -Nc pg_filedump-9.2.0/Makefile.contrib pg_filedump-9.3.0j/Makefile.contrib *** pg_filedump-9.2.0/Makefile.contrib 2012-03-12 08:52:57.000000000 -0700 --- pg_filedump-9.3.0j/Makefile.contrib 2013-06-09 21:16:17.524181706 -0700 *************** *** 1,5 **** PROGRAM = pg_filedump ! OBJS = pg_filedump.o DOCS = README.pg_filedump --- 1,5 ---- PROGRAM = pg_filedump ! OBJS = pg_filedump.o checksum.o DOCS = README.pg_filedump diff -Nc pg_filedump-9.2.0/pg_filedump.c pg_filedump-9.3.0j/pg_filedump.c *** pg_filedump-9.2.0/pg_filedump.c 2012-03-12 08:58:31.000000000 -0700 --- pg_filedump-9.3.0j/pg_filedump.c 2013-06-09 21:46:21.240147414 -0700 *************** *** 40,51 **** static void DisplayOptions (unsigned int validOptions); static unsigned int ConsumeOptions (int numOptions, char **options); static int GetOptionValue (char *optionString); ! static void FormatBlock (); static unsigned int GetBlockSize (); static unsigned int GetSpecialSectionType (Page page); static bool IsBtreeMetaPage(Page page); static void CreateDumpFileHeader (int numOptions, char **options); ! static int FormatHeader (Page page); static void FormatItemBlock (Page page); static void FormatItem (unsigned int numBytes, unsigned int startIndex, unsigned int formatAs); --- 40,51 ---- static void DisplayOptions (unsigned int validOptions); static unsigned int ConsumeOptions (int numOptions, char **options); static int GetOptionValue (char *optionString); ! static void FormatBlock (BlockNumber blkno); static unsigned int GetBlockSize (); static unsigned int GetSpecialSectionType (Page page); static bool IsBtreeMetaPage(Page page); static void CreateDumpFileHeader (int numOptions, char **options); ! static int FormatHeader (Page page, BlockNumber blkno); static void FormatItemBlock (Page page); static void FormatItem (unsigned int numBytes, unsigned int startIndex, unsigned int formatAs); *************** *** 54,60 **** static void FormatBinary (unsigned int numBytes, unsigned int startIndex); static void DumpBinaryBlock (); static void DumpFileContents (); ! // Send properly formed usage information to the user. static void --- 54,60 ---- static void FormatBinary (unsigned int numBytes, unsigned int startIndex); static void DumpBinaryBlock (); static void DumpFileContents (); ! static uint16 PageCalcChecksum16 (Page page, BlockNumber blkno); // Send properly formed usage information to the user. static void *************** *** 288,293 **** --- 288,298 ---- SET_OPTION (itemOptions, ITEM_DETAIL, 'i'); break; + // Verify block checksums + case 'k': + SET_OPTION (blockOptions, BLOCK_CHECKSUMS, 'k'); + break; + // Interpret items as standard index values case 'x': SET_OPTION (itemOptions, ITEM_INDEX, 'x'); *************** *** 522,527 **** --- 527,561 ---- return false; } + static uint16 + PageCalcChecksum16(Page page, BlockNumber blkno) + { + PageHeader phdr = (PageHeader) page; + uint16 save_checksum; + uint32 checksum; + + /* + * Save pd_checksum and set it to zero, so that the checksum calculation + * isn't affected by the checksum stored on the page. We do this to + * allow optimization of the checksum calculation on the whole block + * in one go. + */ + save_checksum = phdr->pd_checksum; + phdr->pd_checksum = 0; + checksum = checksum_block(page, BLCKSZ); + phdr->pd_checksum = save_checksum; + + /* mix in the block number to detect transposed pages */ + checksum ^= blkno; + + /* + * Reduce to a uint16 (to fit in the pd_checksum field) with an offset of + * one. That avoids checksums of zero, which seems like a good idea. + */ + return (checksum % 65535) + 1; + } + + // Display a header for the dump so we know the file name, the options // used and the time the dump was taken static void *************** *** 555,561 **** // Dump out a formatted block header for the requested block static int ! FormatHeader (Page page) { int rc = 0; unsigned int headerBytes; --- 589,595 ---- // Dump out a formatted block header for the requested block static int ! FormatHeader (Page page, BlockNumber blkno) { int rc = 0; unsigned int headerBytes; *************** *** 609,623 **** " Block: Size %4d Version %4u Upper %4u (0x%04hx)\n" " LSN: logid %6d recoff 0x%08x Special %4u (0x%04hx)\n" " Items: %4d Free Space: %4u\n" ! " TLI: 0x%04x Prune XID: 0x%08x Flags: 0x%04x (%s)\n" " Length (including item array): %u\n\n", pageOffset, pageHeader->pd_lower, pageHeader->pd_lower, (int) PageGetPageSize (page), blockVersion, pageHeader->pd_upper, pageHeader->pd_upper, ! pageLSN.xlogid, pageLSN.xrecoff, pageHeader->pd_special, pageHeader->pd_special, maxOffset, pageHeader->pd_upper - pageHeader->pd_lower, ! pageHeader->pd_tli, pageHeader->pd_prune_xid, pageHeader->pd_flags, flagString, headerBytes); --- 643,657 ---- " Block: Size %4d Version %4u Upper %4u (0x%04hx)\n" " LSN: logid %6d recoff 0x%08x Special %4u (0x%04hx)\n" " Items: %4d Free Space: %4u\n" ! " Checksum: %05hu Prune XID: 0x%08x Flags: 0x%04x (%s)\n" " Length (including item array): %u\n\n", pageOffset, pageHeader->pd_lower, pageHeader->pd_lower, (int) PageGetPageSize (page), blockVersion, pageHeader->pd_upper, pageHeader->pd_upper, ! (uint32) (pageLSN >> 32), (uint32) pageLSN, pageHeader->pd_special, pageHeader->pd_special, maxOffset, pageHeader->pd_upper - pageHeader->pd_lower, ! pageHeader->pd_checksum, pageHeader->pd_prune_xid, pageHeader->pd_flags, flagString, headerBytes); *************** *** 647,652 **** --- 681,694 ---- || (pageHeader->pd_upper < pageHeader->pd_lower) || (pageHeader->pd_special > blockSize)) printf (" Error: Invalid header information.\n\n"); + + if (blockOptions & BLOCK_CHECKSUMS) + { + uint16 calc_checksum = PageCalcChecksum16(page, blkno); + if (calc_checksum != pageHeader->pd_checksum) + printf(" Error: checksum failure: calculated %05hu.\n\n", + calc_checksum); + } } // If we have reached the end of file while interpreting the header, let *************** *** 933,939 **** printf (" XMIN: %u XMAX: %u CID|XVAC: %u", HeapTupleHeaderGetXmin(htup), ! HeapTupleHeaderGetXmax(htup), HeapTupleHeaderGetRawCommandId(htup)); if (infoMask & HEAP_HASOID) --- 975,981 ---- printf (" XMIN: %u XMAX: %u CID|XVAC: %u", HeapTupleHeaderGetXmin(htup), ! HeapTupleHeaderGetRawXmax(htup), HeapTupleHeaderGetRawCommandId(htup)); if (infoMask & HEAP_HASOID) *************** *** 958,969 **** strcat (flagString, "HASEXTERNAL|"); if (infoMask & HEAP_HASOID) strcat (flagString, "HASOID|"); if (infoMask & HEAP_COMBOCID) strcat (flagString, "COMBOCID|"); if (infoMask & HEAP_XMAX_EXCL_LOCK) strcat (flagString, "XMAX_EXCL_LOCK|"); ! if (infoMask & HEAP_XMAX_SHARED_LOCK) ! strcat (flagString, "XMAX_SHARED_LOCK|"); if (infoMask & HEAP_XMIN_COMMITTED) strcat (flagString, "XMIN_COMMITTED|"); if (infoMask & HEAP_XMIN_INVALID) --- 1000,1015 ---- strcat (flagString, "HASEXTERNAL|"); if (infoMask & HEAP_HASOID) strcat (flagString, "HASOID|"); + if (infoMask & HEAP_XMAX_KEYSHR_LOCK) + strcat (flagString, "XMAX_KEYSHR_LOCK|"); if (infoMask & HEAP_COMBOCID) strcat (flagString, "COMBOCID|"); if (infoMask & HEAP_XMAX_EXCL_LOCK) strcat (flagString, "XMAX_EXCL_LOCK|"); ! if (infoMask & HEAP_XMAX_SHR_LOCK) ! strcat (flagString, "XMAX_SHR_LOCK|"); ! if (infoMask & HEAP_XMAX_LOCK_ONLY) ! strcat (flagString, "XMAX_LOCK_ONLY|"); if (infoMask & HEAP_XMIN_COMMITTED) strcat (flagString, "XMIN_COMMITTED|"); if (infoMask & HEAP_XMIN_INVALID) *************** *** 981,986 **** --- 1027,1034 ---- if (infoMask & HEAP_MOVED_IN) strcat (flagString, "MOVED_IN|"); + if (infoMask2 & HEAP_KEYS_UPDATED) + strcat (flagString, "KEYS_UPDATED|"); if (infoMask2 & HEAP_HOT_UPDATED) strcat (flagString, "HOT_UPDATED|"); if (infoMask2 & HEAP_ONLY_TUPLE) *************** *** 1204,1210 **** // For each block, dump out formatted header and content information static void ! FormatBlock () { Page page = (Page) buffer; pageOffset = blockSize * currentBlock; --- 1252,1258 ---- // For each block, dump out formatted header and content information static void ! FormatBlock (BlockNumber blkno) { Page page = (Page) buffer; pageOffset = blockSize * currentBlock; *************** *** 1224,1230 **** int rc; // Every block contains a header, items and possibly a special // section. Beware of partial block reads though ! rc = FormatHeader (page); // If we didn't encounter a partial read in the header, carry on... if (rc != EOF_ENCOUNTERED) --- 1272,1278 ---- int rc; // Every block contains a header, items and possibly a special // section. Beware of partial block reads though ! rc = FormatHeader (page, blkno); // If we didn't encounter a partial read in the header, carry on... if (rc != EOF_ENCOUNTERED) *************** *** 1340,1354 **** controlData->system_identifier, dbState, ctime (&(cd_time)), ! controlData->checkPoint.xlogid, controlData->checkPoint.xrecoff, ! controlData->prevCheckPoint.xlogid, controlData->prevCheckPoint.xrecoff, ! checkPoint->redo.xlogid, checkPoint->redo.xrecoff, checkPoint->ThisTimeLineID, checkPoint->nextXidEpoch, checkPoint->nextXid, checkPoint->nextOid, checkPoint->nextMulti, checkPoint->nextMultiOffset, ctime (&cp_time), ! controlData->minRecoveryPoint.xlogid, controlData->minRecoveryPoint.xrecoff, controlData->maxAlign, controlData->floatFormat, (controlData->floatFormat == FLOATFORMAT_VALUE ? --- 1388,1402 ---- controlData->system_identifier, dbState, ctime (&(cd_time)), ! (uint32) (controlData->checkPoint >> 32), (uint32) controlData->checkPoint, ! (uint32) (controlData->prevCheckPoint >> 32), (uint32) controlData->prevCheckPoint, ! (uint32) (checkPoint->redo >> 32), (uint32) checkPoint->redo, checkPoint->ThisTimeLineID, checkPoint->nextXidEpoch, checkPoint->nextXid, checkPoint->nextOid, checkPoint->nextMulti, checkPoint->nextMultiOffset, ctime (&cp_time), ! (uint32) (controlData->minRecoveryPoint), (uint32) (controlData->minRecoveryPoint), controlData->maxAlign, controlData->floatFormat, (controlData->floatFormat == FLOATFORMAT_VALUE ? *************** *** 1494,1500 **** contentsToDump = false; } else ! FormatBlock (); } } --- 1542,1548 ---- contentsToDump = false; } else ! FormatBlock (currentBlock); } } diff -Nc pg_filedump-9.2.0/pg_filedump.h pg_filedump-9.3.0j/pg_filedump.h *** pg_filedump-9.2.0/pg_filedump.h 2012-03-12 08:58:23.000000000 -0700 --- pg_filedump-9.3.0j/pg_filedump.h 2013-06-09 21:28:26.944167838 -0700 *************** *** 22,29 **** * Original Author: Patrick Macdonald <patri...@redhat.com> */ ! #define FD_VERSION "9.2.0" /* version ID of pg_filedump */ ! #define FD_PG_VERSION "PostgreSQL 9.2.x" /* PG version it works with */ #include "postgres.h" --- 22,29 ---- * Original Author: Patrick Macdonald <patri...@redhat.com> */ ! #define FD_VERSION "9.3.0" /* version ID of pg_filedump */ ! #define FD_PG_VERSION "PostgreSQL 9.3.x" /* PG version it works with */ #include "postgres.h" *************** *** 34,44 **** --- 34,46 ---- #include "access/gist.h" #include "access/hash.h" #include "access/htup.h" + #include "access/htup_details.h" #include "access/itup.h" #include "access/nbtree.h" #include "access/spgist_private.h" #include "catalog/pg_control.h" #include "storage/bufpage.h" + #include "storage/checksum.h" // Options for Block formatting operations static unsigned int blockOptions = 0; *************** *** 49,55 **** BLOCK_FORMAT = 0x00000004, // -f: Formatted dump of blocks / control file BLOCK_FORCED = 0x00000008, // -S: Block size forced BLOCK_NO_INTR = 0x00000010, // -d: Dump straight blocks ! BLOCK_RANGE = 0x00000020 // -R: Specific block range to dump } blockSwitches; --- 51,58 ---- BLOCK_FORMAT = 0x00000004, // -f: Formatted dump of blocks / control file BLOCK_FORCED = 0x00000008, // -S: Block size forced BLOCK_NO_INTR = 0x00000010, // -d: Dump straight blocks ! BLOCK_RANGE = 0x00000020, // -R: Specific block range to dump ! BLOCK_CHECKSUMS = 0x00000040 // -k: verify block checksums } blockSwitches;
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers