On Thu, Jul 24, 2014 at 3:35 AM, <m...@byrney.com> wrote: > I have a suggestion for a table checksumming facility within PostgreSQL. > The applications are reasonably obvious - detecting changes to tables, > validating data migrations, unit testing etc. A possible algorithm is as > follows: > > 1. For each row of the table, take the binary representations of the > values and serialise them to CSV. > 2. Calculate the MD5 sum of each CSV-serialised row. > 3. XOR the row MD5 sums together. > 4. CSV-serialise and MD5 a list of representations (of some sort) of the > types of the table's columns and XOR it with the rest. > 5. Output the result as the table's checksum. > > Advantages of this approach: > > 1. Easily implemented using SPI. > 2. Since XOR is commutative and associative, order of ingestion of rows > doesn't matter; therefore, unlike some other table checksumming methods, > this doesn't need an expensive ORDER BY *. So, this should be pretty much > as fast as a SELECT * FROM, which is probably as fast as a table checksum > can be. > 3. Using a cursor in SPI, rows can be ingested a few at a time. So memory > footprint is low even for large tables. > 4. Output has a convenient fixed size of 128 bits. > > Questions: > > 1. Should this be a contrib module which provides a function, or should it > be a built-in piece of functionality? > 2. Is MD5 too heavyweight for this? Would using a non-cryptographic > checksum be worth the speed boost? > 3. Is there a risk of different architectures/versions returning different > checksums for tables which could be considered identical? If so, is this > worth worrying about? >
Hmm - Do you really think we need an extension for something that can be done using query as simple as: select md5(string_agg(md5(c::text), '' order by md5(c::text))) from pg_class c; (of course you can do it on any table, not only pg_class). If you want to use the xor idea (which make sense), all you need is to write xor aggregate. depesz