On Wed, Nov 03, 2010 at 10:24:16AM +0100, Nicolas Barbier wrote: > 2010/11/2 Kenneth Marshall <k...@rice.edu>: > > > Given that our hash implimentation mixes the input data well (It does. > > I tested it.) then a simple rotate-and-xor method is all that should > > be needed to maintain all of the needed information. The original > > hash function has done the heavy lifting in this case. > > Even with the perfect hash function for the elements, certain > combinations of elements could still lead to massive collisions. E.g., > if repeated values are typical in the input data we are talking about, > then the rotate-and-xor method would still lead to collisions between > any array of the same values of certain lengths, regardless of the > value. In Tom's implementation, as he mentioned before, those > problematical lengths would be multiples of 32 (e.g., an array of 32 > 1s would collide with an array of 32 2s would collide with an array of > 32 3s, etc). > > Nicolas >
True. I just took another look at our defined hash functions and it looks like we can make a simple variant of hash_uint32() that we can use as a stream checksum. The only thing missing is that ability to pass in the current 32-bit hash value as a starting seed to add the next 32-bit value. Something like this should work: Datum hash_uint32(uint32 k, uint32 initval) { register uint32 a, b, c; a = b = c = 0x9e3779b9 + (uint32) sizeof(uint32) + 3923095 + initval; a += k; final(a, b, c); /* report the result */ return UInt32GetDatum(c); } Then if you pass in the current value as the initval, it should mix well each additional 32-bit hash value. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers