Re: New and old producers partition messages differently

Wes Chow Sun, 26 Apr 2015 20:53:44 -0700

Along these lines too, is the function customizable? I could see howmmh3 (or 2) would be generally sufficient, however in some cases you maywant something that's a bit more cryptographically secure so as to avoidattacks.

(Though I suppose the programmer could first crypto-hash the key, andthen pass it through mmh.)

Wes

Evan Huus <mailto:evan.h...@shopify.com>
April 26, 2015 11:51 AM
Related to this topic: why the choice of murmur2 over murmur3? I'm not
super-familiar with the differences between the two, but I'd assumemurmur3
would be faster or have a more even distribution or something.

Evan

P.S. Also, there appear to be many murmur3 implementations for other
languages, whereas murmur2 is much less common.


Jay Kreps <mailto:jay.kr...@gmail.com>
April 26, 2015 10:57 AM
This was actually intentional.

The problem with relying on hashCode is that
(1) it is often a very bad hash function,
(2) it is not guaranteed to be consistent from run to run (i.e. if you
restart the jvm the value of hashing the same key can change!),
(3) it is not available outside the jvm so non-java producers can'tuse the
same function.

In general at the moment different producers don't use the same hash code
so I think this is not quite as bad as it sounds. Though it would be good
to standardize things.

I think the most obvious thing we could do here would be to do a much
better job of advertising this in the docs, though, so people don't get
bitten by it.

-Jay


James Cheng <mailto:jch...@tivo.com>
April 24, 2015 8:48 PM
Hi,
I was playing with the new producer in 0.8.2.1 using partition keys("semantic partitioning" I believe is the phrase?). I noticed that thedefault partitioner in 0.8.2.1 does not partition items the same wayas the old 0.8.1.1 default partitioner was doing. For a test item, theold producer was sending it to partition 0, whereas the new producerwas sending it to partition 4.
Digging in the code, it appears that the partitioning logic isdifferent between the old and new producers. Both of them hash thekey, but they use different hashing algorithms.
Old partitioner:
./core/src/main/scala/kafka/producer/DefaultPartitioner.scala:

def partition(key: Any, numPartitions: Int): Int = {
Utils.abs(key.hashCode) % numPartitions
}

New partitioner:
./clients/src/main/java/org/apache/kafka/clients/producer/internals/Partitioner.java:

} else {
// hash the key to choose a partition
return Utils.abs(Utils.murmur2(record.key())) % numPartitions;
}
Where murmur2 is a custom hashing algorithm. (I'm assuming thatmurmur2 isn't the same logic as hashCode, especially since hashCode isoverrideable).
Was it intentional that the hashing algorithm would change between theold and new producer? If so, was this documented? I don't know ifanyone was relying on the old default partitioner, as opposed to goinground-robin or using their own custom partitioner. Do you expect it tochange in the future? I'm guessing that one of the main reasons tohave a custom hashing algorithm is so that you are full control of thepartitioning and can keep it stable (as opposed to being reliant onhashCode()).
Thanks,
-James

Re: New and old producers partition messages differently

Reply via email to