I have a suggestion for a table checksumming facility within PostgreSQL. The applications are reasonably obvious - detecting changes to tables, validating data migrations, unit testing etc. A possible algorithm is as follows:
1. For each row of the table, take the binary representations of the values and serialise them to CSV. 2. Calculate the MD5 sum of each CSV-serialised row. 3. XOR the row MD5 sums together. 4. CSV-serialise and MD5 a list of representations (of some sort) of the types of the table's columns and XOR it with the rest. 5. Output the result as the table's checksum. Advantages of this approach: 1. Easily implemented using SPI. 2. Since XOR is commutative and associative, order of ingestion of rows doesn't matter; therefore, unlike some other table checksumming methods, this doesn't need an expensive ORDER BY *. So, this should be pretty much as fast as a SELECT * FROM, which is probably as fast as a table checksum can be. 3. Using a cursor in SPI, rows can be ingested a few at a time. So memory footprint is low even for large tables. 4. Output has a convenient fixed size of 128 bits. Questions: 1. Should this be a contrib module which provides a function, or should it be a built-in piece of functionality? 2. Is MD5 too heavyweight for this? Would using a non-cryptographic checksum be worth the speed boost? 3. Is there a risk of different architectures/versions returning different checksums for tables which could be considered identical? If so, is this worth worrying about? I have knocked up some sample code if anyone is interested. Regards, Matthew -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general