Filehandles can definitely be a concern here, but you can mitigate it to
some extent by adding more brokers to the cluster. The number of open file
handles is going to be driven in large part by the number of log files on
disk. This, in turn, is governed by the number of partitions and how many
files you have for each partition. That, in turn, is governed by the
amount of data you see for each partition and your retention settings.
It's a lot of tweaking :)

So, for one example, I have a cluster with over 13k partitions on it. Up
until recently, it was running on 5 brokers (we just added 4 more). One of
those brokers has about 6300 log files on disk right now, and it's running
with over 7000 open filehandles. Given that the message traffic stays the
same, if I wanted to reduce the number of log files, I could increase the
size of each log file. I could also decrease the retention time for the
data. Another option is to increase the number of brokers, to spread out
the load more evenly. So you have options to keep your filehandles
manageable.

The other side of this is whether or not the controller can efficiently
handle that many topics and partitions in a single cluster, but that's not
a filehandles problem. So far, I've not seen controller performance issues
with any of our clusters, but that could potentially change if you go up
an order of magnitude on the partition count in a single cluster. Is there
the possibility of splitting your customers out to multiple clusters if
that was identified as a problem?

-Todd

On 3/20/14 9:30 PM, "Neelesh" <neele...@gmail.com> wrote:

>Hi,
>   We are prototyping kafka + storm for our stream processing / event
>processing needs. One of the issues we face is a huge influx of stream
>data
>from one of our customers. If we have a single topic for this stream for
>all customers, other customers who are behind the big customer stream
>would
>starve for significant time, until their turn comes.
>    One idea is to create a topic per customer per use case, implement a
>fairness algorithm on top of the high level consumer using
>*createMessageStreamsByFilter* and use that to build a  storm Spout.
>However, this also means tens of thousands of topics and several tens of
>thousands (even hundreds of thousands) of partitions on a single kafka
>cluster.
>     I remember reading that you are effectively limited by filehandles.
>Has anyone tried such a setup ?
>
>Thanks!
>-Neelesh

Reply via email to