Re: [HACKERS] MD5 aggregate

Noah Misch Wed, 26 Jun 2013 11:33:11 -0700

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> I've been playing around with the idea of an aggregate that computes
> the sum of the md5 hashes of each of its inputs, which I've called
> md5_total() for now, although I'm not particularly wedded to that
> name. Comparing it with md5_agg() on a 100M row table (see attached
> test script) produces interesting results:
> 
> SELECT md5_agg(foo.*::text)
>   FROM (SELECT * FROM foo ORDER BY id) foo;
> 
>  50bc42127fb9b028c9708248f835ed8f
> 
> Time: 92960.021 ms
> 
> SELECT md5_total(foo.*::text) FROM foo;
> 
>  02faea7fafee4d253fc94cfae031afc43c03479c
> 
> Time: 96190.343 ms
> 
> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> result is longer) but it seems like it would be very useful for
> quickly comparing data in SQL, since its value is not dependent on the
> row-order making it easier to use and better performing if there is no
> usable index for ordering.


I took a look at this patch.  First, as Peter suggested upthread, it should
have been two patches: one to optimize calculateDigestFromBuffer() and
callees, another atop the first adding new SQL functions and adjusting the
infrastructure to meet their needs.

> +         <primary>md5_total</primary>
> +        </indexterm>
> +        <function>md5_total(<replaceable 
> class="parameter">expression</replaceable>)</function>
> +       </entry>
> +       <entry><type>text</type> or <type>bytea</type></entry>
> +       <entry><type>text</type></entry>
> +       <entry>
> +        sum of the MD5 hash values of the input values, independent of their
> +        order

This is not specific enough.  We would need to provide either an algorithm
specification or a literature reference.  However, that just leads to the
problem that we should not put a literature-unrecognized cryptographic
algorithm into core PostgreSQL.  I realize that the use case inspiring this
patch wasn't concerned with cryptographic properties, but the association with
md5 immediately makes it a cryptography offering.

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible.  However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally.  (We do carry legacy stuff that wouldn't make
the cut today.)  In my estimation, md5_agg() does not make that cut.  The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability.  It could just as well live in an extension
published on PGXN.  Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

> *** a/src/backend/libpq/md5.c
> --- b/src/backend/libpq/md5.c
> ***************
> *** 1,14 ****
>   /*
>    *  md5.c
>    *
> !  *  Implements      the  MD5 Message-Digest Algorithm as specified in
> !  *  RFC  1321.      This  implementation  is a simple one, in that it
> !  *  needs  every  input  byte  to  be  buffered  before doing any
> !  *  calculations.  I  do  not  expect  this  file  to be used for
> !  *  general  purpose  MD5'ing  of large amounts of data, only for
> !  *  generating hashed passwords from limited input.

In other words, performance wasn't a design criterion.  I wonder if we would
be wasting our time with a narrow performance change like removing the data
copy.  What's your opinion of copying pgcrypto's md5 implementation, which
already avoids the data copy?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MD5 aggregate

Reply via email to