On 21 June 2013 21:04, David Fetter <da...@fetter.org> wrote: > On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote: >> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: >> > On 15 June 2013 10:22, Dean Rasheed <dean.a.rash...@gmail.com> wrote: >> > > There seem to be 2 separate directions that this could go, which >> > > really meet different requirements: >> > > >> > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of >> > > the order in which they are scanned. A possible approach to this might >> > > be something like an aggregate >> > > >> > > md5_total(text/bytea) returns text >> > > >> > > that returns the sum of the md5 values of each input value, treating >> > > each md5 value as an unsigned 128-bit integer, and then producing the >> > > hexadecimal representation of the final sum. This should out-perform a >> > > solution based on numeric addition, and in typical cases, the result >> > > wouldn't be much longer than a regular md5 sum, and so would be easy >> > > to eyeball for differences. >> > > >> > >> > I've been playing around with the idea of an aggregate that computes >> > the sum of the md5 hashes of each of its inputs, which I've called >> > md5_total() for now, although I'm not particularly wedded to that >> > name. Comparing it with md5_agg() on a 100M row table (see attached >> > test script) produces interesting results: >> > >> > SELECT md5_agg(foo.*::text) >> > FROM (SELECT * FROM foo ORDER BY id) foo; >> > >> > 50bc42127fb9b028c9708248f835ed8f >> > >> > Time: 92960.021 ms >> > >> > SELECT md5_total(foo.*::text) FROM foo; >> > >> > 02faea7fafee4d253fc94cfae031afc43c03479c >> > >> > Time: 96190.343 ms >> > >> > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its >> > result is longer) but it seems like it would be very useful for >> > quickly comparing data in SQL, since its value is not dependent on the >> > row-order making it easier to use and better performing if there is no >> > usable index for ordering. >> > >> > Note, however, that if there is an index that can be used for >> > ordering, the performance is not necessarily better than md5_agg(), as >> > this example shows. There is a small additional overhead per row for >> > initialising the MD5 sums, and adding the results to the total, but I >> > think the biggest factor is that md5_total() is processing more data. >> > The reason is that MD5 works on 64-byte blocks, so the total amount of >> > data going through the core MD5 algorithm is each row's size is >> > rounded up to a multiple of 64. In this simple case it ends up >> > processing around 1.5 times as much data: >> > >> > SELECT sum(length(foo.*::text)) AS md5_agg, >> > sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; >> > >> > md5_agg | md5_total >> > ------------+------------- >> > 8103815438 | 12799909248 >> > >> > although of course that overhead won't be as large on wider tables, >> > and even in this case the overall performance is still on a par with >> > md5_agg(). >> > >> > ISTM that both aggregates are potentially useful in different >> > situations. I would probably typically use md5_total() because of its >> > simplicity/order-independence and consistent performance, but >> > md5_agg() might also be useful when comparing with external data. >> > >> > Regards, >> > Dean > >> Performance review (skills needed: ability to time performance) >> >> Does the patch slow down simple tests? >> >> Yes. For an MD5 checksum of some random data, here are >> results from master: >> >> shackle@postgres:5493=# WITH t1 AS (SELECT >> string_agg(chr(floor(255*random()+1)::int),'') AS a FROM >> generate_series(1,10000)), >> postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN >> generate_series(1,10000)) >> postgres-# select md5(a::text) FROM t2; >> shackle@postgres:5493=# \timing >> Timing is on. >> shackle@postgres:5493=# \g >> Time: 955.393 ms >> shackle@postgres:5493=# \g >> Time: 950.909 ms >> shackle@postgres:5493=# \g >> Time: 947.955 ms >> shackle@postgres:5493=# \g >> Time: 946.193 ms >> shackle@postgres:5493=# \g >> Time: 950.591 ms >> shackle@postgres:5493=# \g >> Time: 952.346 ms >> shackle@postgres:5493=# \g >> Time: 948.623 ms >> shackle@postgres:5493=# \g >> Time: 939.819 ms >> >> and here from master + the patch: >> >> WITH t1 AS (SELECT >> string_agg(chr(floor(255*random()+1)::int),'') AS a FROM >> generate_series(1,10000)), >> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000)) >> select md5(a::text) FROM t2; >> Time: 1129.178 ms >> shackle@postgres:5494=# \g >> Time: 1134.013 ms >> shackle@postgres:5494=# \g >> Time: 1124.387 ms >> shackle@postgres:5494=# \g >> Time: 1119.733 ms >> shackle@postgres:5494=# \g >> Time: 1104.906 ms >> shackle@postgres:5494=# \g >> Time: 1137.055 ms >> shackle@postgres:5494=# \g >> Time: 1128.977 ms >> shackle@postgres:5494=# \g >> Time: 1143.619 ms >> >> Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms >> Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms > > The above was with -O0. Please find below with vanilla ./configure (-O2): > > Master: > shackle@postgres:5493=# WITH t1 AS (SELECT > string_agg(chr(floor(255*random()+1)::int),'') AS a FROM > generate_series(1,10000)), > t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000)) > select md5(a::text) FROM t2; > Time: 469.101 ms > shackle@postgres:5493=# \g > Time: 464.122 ms > shackle@postgres:5493=# \g > Time: 461.411 ms > shackle@postgres:5493=# \g > Time: 458.222 ms > shackle@postgres:5493=# \g > Time: 463.616 ms > shackle@postgres:5493=# \g > Time: 463.983 ms > shackle@postgres:5493=# \g > Time: 453.770 ms > shackle@postgres:5493=# \g > Time: 456.729 ms > > MD5: > WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a > FROM generate_series(1,10000)), > t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000)) > select md5(a::text) FROM t2; > Time: 436.862 ms > shackle@postgres:5494=# \t > Showing only tuples. > shackle@postgres:5494=# \t > Tuples only is off. > shackle@postgres:5494=# \g > Time: 438.686 ms > shackle@postgres:5494=# \g > Time: 443.789 ms > shackle@postgres:5494=# \g > Time: 452.522 ms > shackle@postgres:5494=# \g > Time: 447.824 ms > shackle@postgres:5494=# \g > Time: 448.547 ms > shackle@postgres:5494=# \g > Time: 446.901 ms > shackle@postgres:5494=# \g > Time: 448.537 ms > > Average, stddev population for master: 461.36925, 4.5883631545 > Average, stddev population for md5: 445.4585, 4.9892286729 > > so slightly faster on MD5, as expected. > > Cheers, > David.
Thanks for the review and testing! Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers