We are doing work for supporting custom partitioner, so everything is customizable :)
On Sun, Apr 26, 2015 at 8:52 PM, Wes Chow <w...@chartbeat.com> wrote: > > Along these lines too, is the function customizable? I could see how mmh3 > (or 2) would be generally sufficient, however in some cases you may want > something that's a bit more cryptographically secure so as to avoid attacks. > > (Though I suppose the programmer could first crypto-hash the key, and then > pass it through mmh.) > > Wes > > Evan Huus <evan.h...@shopify.com> > April 26, 2015 11:51 AM > Related to this topic: why the choice of murmur2 over murmur3? I'm not > super-familiar with the differences between the two, but I'd assume murmur3 > would be faster or have a more even distribution or something. > > Evan > > P.S. Also, there appear to be many murmur3 implementations for other > languages, whereas murmur2 is much less common. > > > Jay Kreps <jay.kr...@gmail.com> > April 26, 2015 10:57 AM > This was actually intentional. > > The problem with relying on hashCode is that > (1) it is often a very bad hash function, > (2) it is not guaranteed to be consistent from run to run (i.e. if you > restart the jvm the value of hashing the same key can change!), > (3) it is not available outside the jvm so non-java producers can't use the > same function. > > In general at the moment different producers don't use the same hash code > so I think this is not quite as bad as it sounds. Though it would be good > to standardize things. > > I think the most obvious thing we could do here would be to do a much > better job of advertising this in the docs, though, so people don't get > bitten by it. > > -Jay > > > James Cheng <jch...@tivo.com> > April 24, 2015 8:48 PM > Hi, > > I was playing with the new producer in 0.8.2.1 using partition keys > ("semantic partitioning" I believe is the phrase?). I noticed that the > default partitioner in 0.8.2.1 does not partition items the same way as the > old 0.8.1.1 default partitioner was doing. For a test item, the old > producer was sending it to partition 0, whereas the new producer was > sending it to partition 4. > > Digging in the code, it appears that the partitioning logic is different > between the old and new producers. Both of them hash the key, but they use > different hashing algorithms. > > Old partitioner: > ./core/src/main/scala/kafka/producer/DefaultPartitioner.scala: > > def partition(key: Any, numPartitions: Int): Int = { > Utils.abs(key.hashCode) % numPartitions > } > > New partitioner: > > ./clients/src/main/java/org/apache/kafka/clients/producer/internals/Partitioner.java: > > } else { > // hash the key to choose a partition > return Utils.abs(Utils.murmur2(record.key())) % numPartitions; > } > > Where murmur2 is a custom hashing algorithm. (I'm assuming that murmur2 > isn't the same logic as hashCode, especially since hashCode is > overrideable). > > Was it intentional that the hashing algorithm would change between the old > and new producer? If so, was this documented? I don't know if anyone was > relying on the old default partitioner, as opposed to going round-robin or > using their own custom partitioner. Do you expect it to change in the > future? I'm guessing that one of the main reasons to have a custom hashing > algorithm is so that you are full control of the partitioning and can keep > it stable (as opposed to being reliant on hashCode()). > > Thanks, > -James > >