Gabriel Reid created KAFKA-2223: ----------------------------------- Summary: Improve distribution of data when using hash-based partitioning Key: KAFKA-2223 URL: https://issues.apache.org/jira/browse/KAFKA-2223 Project: Kafka Issue Type: Improvement Reporter: Gabriel Reid
Both the DefaultPartitioner and ByteArrayPartitioner base themselves on the hash code of keys modulo the number of partitions, along the lines of {code}partition = key.hashCode() % numPartitions{code} (converting to absolute value is ommitted here) This approach is entirely dependent on the _lower bits_ of the hash code being uniformly distributed in order to get good distribution of records over multiple partitions. If the lower bits of the key hash code are not uniformly distributed, the key space will not be uniformly distributed over the partitions. It can be surprisingly easy to get a very poor distribution. As a simple example, if the keys are integer values and are all divisible by 2, then only half of the partitions will receive data (as the hash code of an integer is the integer value itself). This can even be a problem in situations where you would really not expect it. For example, taking the 8-byte big-endian byte-array representation of longs for each timestamp of each second over a period of 24 hours (at millisecond granularity) and partitioning it over 50 partitions results in 34 of the 50 partitions not getting any data at all. The easiest way to resolve this is to have a custom HashPartitioner that applies a supplementary hash function to the return value of the key's hashCode method. This same approach is taken in java.util.HashMap for the exact same reason. One potential issue for a change like this to the default partitioner could be backward compatibility, if there is some kind of logic expecting that a given key would be sent to a given partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)