On Thu, Jul 24, 2014 at 3:35 AM, <m...@byrney.com> wrote:

> I have a suggestion for a table checksumming facility within PostgreSQL.
> The applications are reasonably obvious - detecting changes to tables,
> validating data migrations, unit testing etc.  A possible algorithm is as
> follows:
>
> 1. For each row of the table, take the binary representations of the
> values and serialise them to CSV.
> 2. Calculate the MD5 sum of each CSV-serialised row.
> 3. XOR the row MD5 sums together.
> 4. CSV-serialise and MD5 a list of representations (of some sort) of the
> types of the table's columns and XOR it with the rest.
> 5. Output the result as the table's checksum.
>
> Advantages of this approach:
>
> 1. Easily implemented using SPI.
> 2. Since XOR is commutative and associative, order of ingestion of rows
> doesn't matter; therefore, unlike some other table checksumming methods,
> this doesn't need an expensive ORDER BY *.  So, this should be pretty much
> as fast as a SELECT * FROM, which is probably as fast as a table checksum
> can be.
> 3. Using a cursor in SPI, rows can be ingested a few at a time.  So memory
> footprint is low even for large tables.
> 4. Output has a convenient fixed size of 128 bits.
>
> Questions:
>
> 1. Should this be a contrib module which provides a function, or should it
> be a built-in piece of functionality?
> 2. Is MD5 too heavyweight for this?  Would using a non-cryptographic
> checksum be worth the speed boost?
> 3. Is there a risk of different architectures/versions returning different
> checksums for tables which could be considered identical?  If so, is this
> worth worrying about?
>

Hmm - Do you really think we need an extension for something that can be
done using query as simple as:

select md5(string_agg(md5(c::text), '' order by md5(c::text))) from
pg_class c;

(of course you can do it on any table, not only pg_class).

If you want to use the xor idea (which make sense), all you need is to
write xor aggregate.

depesz

Reply via email to