[ https://issues.apache.org/jira/browse/KAFKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901279#comment-14901279 ]
Abhishek Nigam commented on KAFKA-1599: --------------------------------------- Copying this content verbatim from a newly created ticket which is dup (KAFKA-2552) of this one which details approach 4). I think it is unavoidable to do chaining because even with a more compact representation we might still run into this issue maybe with a larger json. "Essentially a generic approach to this which would require read and write side to change would be as follows: We designate a zookeeper path as scratch: Ex- /admin/scratch Write side When writing json to zookeeper we will chunk it into 1 MB units and store it in different zookeeper nodes from the sratch all but the first chunk. The first chunk will live in the original location as we have it today. Ex- /admin/reassign_partitions Each chunk will have the following format "json incompatible header" something other than "{" length of the zookeeper path to the next json chunk (0 means that this is the last chunk) zookeeper path of the next json chunk. length of chunk of json data blob. chunk of json data blob. We will write to this conceptual linked list back to front. Read side The zookeeper watch will be fired as before. While reading if we detect there are more chunks we will do synced read from zookeeper." > Change preferred replica election admin command to handle large clusters > ------------------------------------------------------------------------ > > Key: KAFKA-1599 > URL: https://issues.apache.org/jira/browse/KAFKA-1599 > Project: Kafka > Issue Type: Improvement > Affects Versions: 0.8.2.0 > Reporter: Todd Palino > Assignee: Abhishek Nigam > Labels: newbie++ > > We ran into a problem with a cluster that has 70k partitions where we could > not trigger a preferred replica election for all topics and partitions using > the admin tool. Upon investigation, it was determined that this was because > the JSON object that was being written to the admin znode to tell the > controller to start the election was 1.8 MB in size. As the default Zookeeper > data size limit is 1MB, and it is non-trivial to change, we should come up > with a better way to represent the list of topics and partitions for this > admin command. > I have several thoughts on this so far: > 1) Trigger the command for all topics and partitions with a JSON object that > does not include an explicit list of them (i.e. a flag that says "all > partitions") > 2) Use a more compact JSON representation. Currently, the JSON contains a > 'partitions' key which holds a list of dictionaries that each have a 'topic' > and 'partition' key, and there must be one list item for each partition. This > results in a lot of repetition of key names that is unneeded. Changing this > to a format like this would be much more compact: > {'topics': {'topicName1': [0, 1, 2, 3], 'topicName2': [0,1]}, 'version': 1} > 3) Use a representation other than JSON. Strings are inefficient. A binary > format would be the most compact. This does put a greater burden on tools and > scripts that do not use the inbuilt libraries, but it is not too high. > 4) Use a representation that involves multiple znodes. A structured tree in > the admin command would probably provide the most complete solution. However, > we would need to make sure to not exceed the data size limit with a wide tree > (the list of children for any single znode cannot exceed the ZK data size of > 1MB) > Obviously, there could be a combination of #1 with a change in the > representation, which would likely be appropriate as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)