Thanks Todd. That's the current thinking. We use multiple clusters in a single data center for solr to avoid a similar problem - number of collections per cluster in solr's case. Your numbers are encouraging. I will go ahead with this design for now. Thanks! Neelesh On Mar 21, 2014 6:42 AM, "Todd Palino" <tpal...@linkedin.com> wrote:
> Filehandles can definitely be a concern here, but you can mitigate it to > some extent by adding more brokers to the cluster. The number of open file > handles is going to be driven in large part by the number of log files on > disk. This, in turn, is governed by the number of partitions and how many > files you have for each partition. That, in turn, is governed by the > amount of data you see for each partition and your retention settings. > It's a lot of tweaking :) > > So, for one example, I have a cluster with over 13k partitions on it. Up > until recently, it was running on 5 brokers (we just added 4 more). One of > those brokers has about 6300 log files on disk right now, and it's running > with over 7000 open filehandles. Given that the message traffic stays the > same, if I wanted to reduce the number of log files, I could increase the > size of each log file. I could also decrease the retention time for the > data. Another option is to increase the number of brokers, to spread out > the load more evenly. So you have options to keep your filehandles > manageable. > > The other side of this is whether or not the controller can efficiently > handle that many topics and partitions in a single cluster, but that's not > a filehandles problem. So far, I've not seen controller performance issues > with any of our clusters, but that could potentially change if you go up > an order of magnitude on the partition count in a single cluster. Is there > the possibility of splitting your customers out to multiple clusters if > that was identified as a problem? > > -Todd > > On 3/20/14 9:30 PM, "Neelesh" <neele...@gmail.com> wrote: > > >Hi, > > We are prototyping kafka + storm for our stream processing / event > >processing needs. One of the issues we face is a huge influx of stream > >data > >from one of our customers. If we have a single topic for this stream for > >all customers, other customers who are behind the big customer stream > >would > >starve for significant time, until their turn comes. > > One idea is to create a topic per customer per use case, implement a > >fairness algorithm on top of the high level consumer using > >*createMessageStreamsByFilter* and use that to build a storm Spout. > >However, this also means tens of thousands of topics and several tens of > >thousands (even hundreds of thousands) of partitions on a single kafka > >cluster. > > I remember reading that you are effectively limited by filehandles. > >Has anyone tried such a setup ? > > > >Thanks! > >-Neelesh > >