Re: Number of partitions and disks in a topic

Todd Palino Tue, 01 Dec 2015 09:32:54 -0800

Getting the partitioning right now is only important if your messages are
keyed. If they’re not, stop reading, start with a fairly low number of
partitions, and expand as needed.

1000 partitions per topic is generally not normal. It’s not really a
problem, but you want to size topics appropriately. Every partition
represents open file handles and overhead on the cluster controller. But if
you’re working with keyed messages, size for your eventual data size. We
use a general guideline of keeping partitions on disk under 25 GB (for 4
days of retention - so ~6 GB of compressed messages per day). We find this
gives us a good spread of data in the cluster, and represents a reasonable
amount of network throughput per partition, so it allows us to scale
easily. It also makes for fewer issues with replication within the cluster,
and mirroring to other clusters.

Outside of a guideline like that, partition based on how you want to spread
out your keys. We have a user who wanted 720 partitions for a given topic
because it has a large number of factors, which allows them to run a
variety of counts of consumers and have balanced load.

As far as multiple disks goes, yes, Kafka can make use of multiple log
dirs. However, there are caveats. It’s fairly naive about how it assigns
partitions to disks, and partitions are assigned by the controller to a
broker with no knowledge of the disks underneath. The broker then makes the
assignment to a single disk. In addition, there’s no tool for moving
partitions from one mount point to another without shutting down the broker
and doing it manually.

-Todd

On Tue, Dec 1, 2015 at 4:31 AM, Guillermo Ortiz <konstt2...@gmail.com>
wrote:

> Hello,
>
> I want to size the kafka cluster with just one topic and I'm going to
> process the data with Spark and others applications.
>
> If I have six hard drives per node, is it kafka smart enough to deal with
> them? I guess that the memory should be very important in this point and
> all data is cached in memory. Is it possible to config kafka to use many
> directories as HDFS, each one with a different disk?
>
> I'm not sure about the number of partitions either. I have read
>
> http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
> and they talk about number of partitions much higher that I had thought. Is
> it normal to have a topic with 1000 partitions? I was thinking about about
> two/four partitions per node. is it wrong my thought?
>
> As I'm going to process data with Spark, I could have numberPartitions
> equals numberExecutors in Spark as max, always thinking in the future and
> sizing higher than that.
>

-- 
*—-*
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming

linkedin.com/in/toddpalino

Re: Number of partitions and disks in a topic

Reply via email to