Hello Experts, We want to distribute data across partitions in Kafka
Cluster.
 Option 1 : Use Null Partition Key which can distribute data across
paritions.
 Option 2 :  Choose Key ( Random UUID ? ) which can help to distribute data
70-80%.

I have seen below side effect on Confluence Page about sending null Keys to
Producer. Is this still valid on newer version of Kafka Producer Lib ?
Why is data not evenly distributed among partitions when a partitioning key
is not specified?

In Kafka producer, a partition key can be specified to indicate the
destination partition of the message. By default, a hashing-based
partitioner is used to determine the partition id given the key, and people
can use customized partitioners also.

To reduce # of open sockets, in 0.8.0 (
https://issues.apache.org/jira/browse/KAFKA-1017), when the partitioning
key is not specified or null, a producer will pick a random partition and
stick to it for some time (default is 10 mins) before switching to another
one. So, if there are fewer producers than partitions, at a given point of
time, some partitions may not receive any data. To alleviate this problem,
one can either reduce the metadata refresh interval or specify a message
key and a customized random partitioner. For more detail see this thread
http://mail-archives.apache.org/mod_mbox/kafka-dev/201310.mbox/%3CCAFbh0Q0aVh%2Bvqxfy7H-%2BMnRFBt6BnyoZk1LWBoMspwSmTqUKMg%40mail.gmail.com%3E

Pls advise on Choosing Partition Key which should not have side effects.

--Senthil

Reply via email to