Filehandles can definitely be a concern here, but you can mitigate it to some extent by adding more brokers to the cluster. The number of open file handles is going to be driven in large part by the number of log files on disk. This, in turn, is governed by the number of partitions and how many files you have for each partition. That, in turn, is governed by the amount of data you see for each partition and your retention settings. It's a lot of tweaking :)
So, for one example, I have a cluster with over 13k partitions on it. Up until recently, it was running on 5 brokers (we just added 4 more). One of those brokers has about 6300 log files on disk right now, and it's running with over 7000 open filehandles. Given that the message traffic stays the same, if I wanted to reduce the number of log files, I could increase the size of each log file. I could also decrease the retention time for the data. Another option is to increase the number of brokers, to spread out the load more evenly. So you have options to keep your filehandles manageable. The other side of this is whether or not the controller can efficiently handle that many topics and partitions in a single cluster, but that's not a filehandles problem. So far, I've not seen controller performance issues with any of our clusters, but that could potentially change if you go up an order of magnitude on the partition count in a single cluster. Is there the possibility of splitting your customers out to multiple clusters if that was identified as a problem? -Todd On 3/20/14 9:30 PM, "Neelesh" <neele...@gmail.com> wrote: >Hi, > We are prototyping kafka + storm for our stream processing / event >processing needs. One of the issues we face is a huge influx of stream >data >from one of our customers. If we have a single topic for this stream for >all customers, other customers who are behind the big customer stream >would >starve for significant time, until their turn comes. > One idea is to create a topic per customer per use case, implement a >fairness algorithm on top of the high level consumer using >*createMessageStreamsByFilter* and use that to build a storm Spout. >However, this also means tens of thousands of topics and several tens of >thousands (even hundreds of thousands) of partitions on a single kafka >cluster. > I remember reading that you are effectively limited by filehandles. >Has anyone tried such a setup ? > >Thanks! >-Neelesh