What I would like to see is a way for inactive topics to automatically get removed after they are inactive for a period of time. That might help in this case.
I added a comment to this larger jira: https://issues.apache.org/jira/browse/KAFKA-330 Perhaps it should really be it's own jira entry. Jason On Tue, Oct 8, 2013 at 10:29 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Thanks Neha. Is it worthwhile to investigate an option to store topic > metadata (partitions, etc) into another consistent data store (MySQL, > HBase, etc)? Should we make this feature pluggable? > > The reason I am thinking we may need to go surpass the 2000 total partition > limit is because there may be genuine use cases to have high number of > topics. For example, in my particular case, I am using Kafka as a buffer to > store data arriving from various sensors deployed in physical world. These > sensors may be short lived or may be long lived. I was thinking of having > individual topics for each sensor. This ways, if a badly behaving sensor > attempts to pushes the data at a much faster rate than we can process as a > Kafka consumer, we will eventually overflow and start losing data for that > particular sensor. However, we can still potentially continue to process > data from other sensors that are pushing data at manageable rate. If I go > with 1 topic for all the sensors, 1 misbehaving sensor can potentially lead > us not catching up with the topic in the retention period thus making us > loose data from all sensors. > > The other issue is that if we go with a topic per sensor and the sensors > are short lived and we have reached a threshold of 2000 sensors already > deployed, Kafka will stop working (because of Zookeeper limitation) if > though the previously deployed sensors may not be active at all. > > I am sure there may be other genuine use cases for having topics much > larger than 2000. > > > On 4 October 2013 19:04, Neha Narkhede <neha.narkh...@gmail.com> wrote: > > > You probably want to think of this in terms of number of partitions on a > > single broker, instead of per topic since I/O is the limiting factor in > > this case. Another factor to consider is total number of partitions in > the > > cluster as Zookeeper becomes a limiting factor there. 30 partitions is > not > > too large provided the total number of partitions doesn't exceed roughly > > couple thousand. To give you an example, some of our clusters are 16 > nodes > > big and some of the topics on those clusters have 30 partitions. > > > > Thanks, > > Neha > > On Oct 4, 2013 4:15 AM, "Aniket Bhatnagar" <aniket.bhatna...@gmail.com> > > wrote: > > > > > I am using kafka as a buffer for data streaming in from various > sources. > > > Since its a time series data, I generate the key to the message by > > > combining source ID and minute in the timestamp. This means I can > utmost > > > have 60 partitions per topic (as each source has its own topic). I have > > > set num.partitions to be 30 (60/2) for each topic in broker config. I > > don't > > > have a very good reason to pick 30 as default number of partitions per > > > topic but I wanted it to be a high number so that I can achieve high > > > parallelism during in-stream processing. I am worried that having a > high > > > number like 30 (default configuration had it as 2), it can negatively > > > impact kafka performance in terms of message throughput or memory > > > consumption. I understand that this can lead to many files per > partition > > > but I am thinking of dealing with it by having multiple directories on > > the > > > same disk if at all I run into issues. > > > > > > My question to the community is that am I prematurely attempting to > > > optimizing the partition number as right now even a partition number > of 5 > > > seems sufficient and hence will run into unwanted issues? Or is 30 an > Ok > > > number to use for number of partitions? > > > > > >