Todd Palino created KAFKA-1599:
----------------------------------

             Summary: Change preferred replica election admin command to handle 
large clusters
                 Key: KAFKA-1599
                 URL: https://issues.apache.org/jira/browse/KAFKA-1599
             Project: Kafka
          Issue Type: Improvement
    Affects Versions: 0.8.2
            Reporter: Todd Palino


We ran into a problem with a cluster that has 70k partitions where we could not 
trigger a preferred replica election for all topics and partitions using the 
admin tool. Upon investigation, it was determined that this was because the 
JSON object that was being written to the admin znode to tell the controller to 
start the election was 1.8 MB in size. As the default Zookeeper data size limit 
is 1MB, and it is non-trivial to change, we should come up with a better way to 
represent the list of topics and partitions for this admin command.

I have several thoughts on this so far:
1) Trigger the command for all topics and partitions with a JSON object that 
does not include an explicit list of them (i.e. a flag that says "all 
partitions")

2) Use a more compact JSON representation. Currently, the JSON contains a 
'partitions' key which holds a list of dictionaries that each have a 'topic' 
and 'partition' key, and there must be one list item for each partition. This 
results in a lot of repetition of key names that is unneeded. Changing this to 
a format like this would be much more compact:
{'topics': {'topicName1': [0, 1, 2, 3], 'topicName2': [0,1]}, 'version': 1}

3) Use a representation other than JSON. Strings are inefficient. A binary 
format would be the most compact. This does put a greater burden on tools and 
scripts that do not use the inbuilt libraries, but it is not too high.

4) Use a representation that involves multiple znodes. A structured tree in the 
admin command would probably provide the most complete solution. However, we 
would need to make sure to not exceed the data size limit with a wide tree (the 
list of children for any single znode cannot exceed the ZK data size of 1MB)

Obviously, there could be a combination of #1 with a change in the 
representation, which would likely be appropriate as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to