Todd Palino created KAFKA-1599:
----------------------------------
Summary: Change preferred replica election admin command to handle
large clusters
Key: KAFKA-1599
URL: https://issues.apache.org/jira/browse/KAFKA-1599
Project: Kafka
Issue Type: Improvement
Affects Versions: 0.8.2
Reporter: Todd Palino
We ran into a problem with a cluster that has 70k partitions where we could not
trigger a preferred replica election for all topics and partitions using the
admin tool. Upon investigation, it was determined that this was because the
JSON object that was being written to the admin znode to tell the controller to
start the election was 1.8 MB in size. As the default Zookeeper data size limit
is 1MB, and it is non-trivial to change, we should come up with a better way to
represent the list of topics and partitions for this admin command.
I have several thoughts on this so far:
1) Trigger the command for all topics and partitions with a JSON object that
does not include an explicit list of them (i.e. a flag that says "all
partitions")
2) Use a more compact JSON representation. Currently, the JSON contains a
'partitions' key which holds a list of dictionaries that each have a 'topic'
and 'partition' key, and there must be one list item for each partition. This
results in a lot of repetition of key names that is unneeded. Changing this to
a format like this would be much more compact:
{'topics': {'topicName1': [0, 1, 2, 3], 'topicName2': [0,1]}, 'version': 1}
3) Use a representation other than JSON. Strings are inefficient. A binary
format would be the most compact. This does put a greater burden on tools and
scripts that do not use the inbuilt libraries, but it is not too high.
4) Use a representation that involves multiple znodes. A structured tree in the
admin command would probably provide the most complete solution. However, we
would need to make sure to not exceed the data size limit with a wide tree (the
list of children for any single znode cannot exceed the ZK data size of 1MB)
Obviously, there could be a combination of #1 with a change in the
representation, which would likely be appropriate as well.
--
This message was sent by Atlassian JIRA
(v6.2#6252)