On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > I've been playing around with the idea of an aggregate that computes > the sum of the md5 hashes of each of its inputs, which I've called > md5_total() for now, although I'm not particularly wedded to that > name. Comparing it with md5_agg() on a 100M row table (see attached > test script) produces interesting results: > > SELECT md5_agg(foo.*::text) > FROM (SELECT * FROM foo ORDER BY id) foo; > > 50bc42127fb9b028c9708248f835ed8f > > Time: 92960.021 ms > > SELECT md5_total(foo.*::text) FROM foo; > > 02faea7fafee4d253fc94cfae031afc43c03479c > > Time: 96190.343 ms > > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its > result is longer) but it seems like it would be very useful for > quickly comparing data in SQL, since its value is not dependent on the > row-order making it easier to use and better performing if there is no > usable index for ordering.
I took a look at this patch. First, as Peter suggested upthread, it should have been two patches: one to optimize calculateDigestFromBuffer() and callees, another atop the first adding new SQL functions and adjusting the infrastructure to meet their needs. > + <primary>md5_total</primary> > + </indexterm> > + <function>md5_total(<replaceable > class="parameter">expression</replaceable>)</function> > + </entry> > + <entry><type>text</type> or <type>bytea</type></entry> > + <entry><type>text</type></entry> > + <entry> > + sum of the MD5 hash values of the input values, independent of their > + order This is not specific enough. We would need to provide either an algorithm specification or a literature reference. However, that just leads to the problem that we should not put a literature-unrecognized cryptographic algorithm into core PostgreSQL. I realize that the use case inspiring this patch wasn't concerned with cryptographic properties, but the association with md5 immediately makes it a cryptography offering. md5_agg() is well-defined and not cryptographically novel, and your use case is credible. However, not every useful-sometimes function belongs in core; we mostly stick to functions with ubiquitous value and functions that would be onerous to implement externally. (We do carry legacy stuff that wouldn't make the cut today.) In my estimation, md5_agg() does not make that cut. The variety of intuitive md5_agg() definitions attested upthread doesn't bode well for its broad applicability. It could just as well live in an extension published on PGXN. Mine is just one opinion, though; I won't be horrified if the community wants an md5_agg() in core. > *** a/src/backend/libpq/md5.c > --- b/src/backend/libpq/md5.c > *************** > *** 1,14 **** > /* > * md5.c > * > ! * Implements the MD5 Message-Digest Algorithm as specified in > ! * RFC 1321. This implementation is a simple one, in that it > ! * needs every input byte to be buffered before doing any > ! * calculations. I do not expect this file to be used for > ! * general purpose MD5'ing of large amounts of data, only for > ! * generating hashed passwords from limited input. In other words, performance wasn't a design criterion. I wonder if we would be wasting our time with a narrow performance change like removing the data copy. What's your opinion of copying pgcrypto's md5 implementation, which already avoids the data copy? Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers