[
https://issues.apache.org/jira/browse/KAFKA-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562077#comment-14562077
]
Jay Kreps commented on KAFKA-2223:
----------------------------------
This is a good point. There are a bunch of problems with using Object.hashcode
as the scala producer does. It is that it is java-specific so another producer
can't use the same partition function. The other problem is that many hashcode
implementations don't provide a very good distribution.
For this reason in the new producer released in 0.8.2 we use the murmur hash of
the binary data which is better.
I think we should probably not change the hash function in the scala version as
that would redistribute data.
> Improve distribution of data when using hash-based partitioning
> ---------------------------------------------------------------
>
> Key: KAFKA-2223
> URL: https://issues.apache.org/jira/browse/KAFKA-2223
> Project: Kafka
> Issue Type: Improvement
> Reporter: Gabriel Reid
> Attachments: KAFKA-2223.patch
>
>
> Both the DefaultPartitioner and ByteArrayPartitioner base themselves on the
> hash code of keys modulo the number of partitions, along the lines of
> {code}partition = key.hashCode() % numPartitions{code} (converting to
> absolute value is ommitted here)
> This approach is entirely dependent on the _lower bits_ of the hash code
> being uniformly distributed in order to get good distribution of records over
> multiple partitions. If the lower bits of the key hash code are not uniformly
> distributed, the key space will not be uniformly distributed over the
> partitions.
> It can be surprisingly easy to get a very poor distribution. As a simple
> example, if the keys are integer values and are all divisible by 2, then only
> half of the partitions will receive data (as the hash code of an integer is
> the integer value itself).
> This can even be a problem in situations where you would really not expect
> it. For example, taking the 8-byte big-endian byte-array representation of
> longs for each timestamp of each second over a period of 24 hours (at
> millisecond granularity) and partitioning it over 50 partitions results in 34
> of the 50 partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that
> applies a supplementary hash function to the return value of the key's
> hashCode method. This same approach is taken in java.util.HashMap for the
> exact same reason.
> One potential issue for a change like this to the default partitioner could
> be backward compatibility, if there is some kind of logic expecting that a
> given key would be sent to a given partition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)