On 12/28/16 12:27 PM, [email protected] wrote: > On Wed, Dec 28, 2016 at 11:48 AM, Ned Batchelder > <[email protected] <mailto:[email protected]>> wrote: > > You can write a simple function to use hash iteratively to hash > the entire stream in constant space and linear time: > > def hash_stream(them): > val = 0 > for it in them: > val = hash((val, it)) > return val > > Although this creates N 2-tuples, they come and go, so the memory > use won't grow. Adjust the code as needed to achieve > canonicalization before iterating. > > Or maybe I am misunderstanding the requirements? > > > This is better than solutions like > http://stackoverflow.com/a/27952689/161642 > <http://stackoverflow.com/a/27952689/161642> in the sense that it's a > little higher level (no bit shifting or magic numbers). > > But it's not clear that it's any better in the sense that you're still > rolling your own incremental hash algorithm out of a lower-level > primitive that doesn't document support for this, and therefore taking > responsibility yourself for how well it distributes values into buckets. > > Are you confident this results in good hash performance? Is this > better than a solution built on top of a hash function with an > explicit API for calculating a hash incrementally, such as the crc32 > example I included? (And again, this would ideally be > a sys.hash_info.hash_bits -bit algorithm.)
I don't have the theoretical background to defend this function. But it seems to me that if we believe that hash((int, thing)) distributes well, then how could this function not distribute well? --Ned.
_______________________________________________ Python-ideas mailing list [email protected] https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
