[jira] [Commented] (KAFKA-2223) Improve distribution of data when using hash-based partitioning

Jay Kreps (JIRA) Wed, 27 May 2015 17:31:30 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562077#comment-14562077
 ]


Jay Kreps commented on KAFKA-2223:
----------------------------------

This is a good point. There are a bunch of problems with using Object.hashcode 
as the scala producer does. It is that it is java-specific so another producer 
can't use the same partition function. The other problem is that many hashcode 
implementations don't provide a very good distribution.

For this reason in the new producer released in 0.8.2 we use the murmur hash of 
the binary data which is better.

I think we should probably not change the hash function in the scala version as 
that would redistribute data.

> Improve distribution of data when using hash-based partitioning
> ---------------------------------------------------------------
>
>                 Key: KAFKA-2223
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2223
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>         Attachments: KAFKA-2223.patch
>
>
> Both the DefaultPartitioner and ByteArrayPartitioner base themselves on the 
> hash code of keys modulo the number of partitions, along the lines of 
> {code}partition = key.hashCode() % numPartitions{code} (converting to 
> absolute value is ommitted here)
> This approach is entirely dependent on the _lower bits_ of the hash code 
> being uniformly distributed in order to get good distribution of records over 
> multiple partitions. If the lower bits of the key hash code are not uniformly 
> distributed, the key space will not be uniformly distributed over the 
> partitions.
> It can be surprisingly easy to get a very poor distribution. As a simple 
> example, if the keys are integer values and are all divisible by 2, then only 
> half of the partitions will receive data (as the hash code of an integer is 
> the integer value itself).
> This can even be a problem in situations where you would really not expect 
> it. For example, taking the 8-byte big-endian byte-array representation of 
> longs for each timestamp of each second over a period of 24 hours (at 
> millisecond granularity) and partitioning it over 50 partitions results in 34 
> of the 50 partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that 
> applies a supplementary hash function to the return value of the key's 
> hashCode method. This same approach is taken in java.util.HashMap for the 
> exact same reason.
> One potential issue for a change like this to the default partitioner could 
> be backward compatibility, if there is some kind of logic expecting that a 
> given key would be sent to a given partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2223) Improve distribution of data when using hash-based partitioning

Reply via email to