On Thursday, 24 October 2013 09:12:21 UTC+8, puzzler wrote: > Even though vectors aren't as prone to collision between similar-looking > vectors, I think you are right that the smallness of the numbers is a > problem: > > Consider the following: > > => (count (set (for [x (range 200) y (range 200)] (hash [x y])))) > 6369 > > This means that out of 40,000 common ordered pairs, there are only 6369 > distinct hash values. That's not good. > > So yes, upon further thought, I think sequence hashing should be improved > as well. > > I want to emphasize that I don't view this as merely a "theoretical" > problem. I have noticed for over a year now, when profiling my Clojure > code, that a surprising amount of time was spent on dealing with hash > collisions. I chalked it up as the natural behavior of hash tables which > will occasionally have collisions, and just learned to live with it, but > the more I play around with the existing hash function on data similar to > what I use (with small numbers in slightly varying structures), the more > I'm convinced that Clojure's weak hashing strategy explains the performance > issues I've had. Would love to see the Clojure community tackle this > head-on; I think there's a lot of performance to be gained for many real > apps by doing this. >
You can't change the hashCode of lists / vectors without breaking the contract for java.util.List (which PersistentList and PersistentVector both implement) - http://docs.oracle.com/javase/7/docs/api/java/util/List.html#hashCode() I don't think breaking this would be a good idea. The java.util.List hashCode isn't actually too bad. It's a decent compromise between an efficient implementation and reasonably effective hashing for most real-world use cases. Sure you can construct artificial examples where you get some collisions, but that's true of any hash code function. But note that the maximum number of collisions is still pretty low: (apply max (vals (frequencies (for [x (range 200) y (range 200)] (hash [x y]))))) => 7 That's actually the number you really care about, since it is the number of objects in the same hash bucket that drives the pathological O(n^2) behaviours. -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.