Getting the partitioning right now is only important if your messages are keyed. If they’re not, stop reading, start with a fairly low number of partitions, and expand as needed.
1000 partitions per topic is generally not normal. It’s not really a problem, but you want to size topics appropriately. Every partition represents open file handles and overhead on the cluster controller. But if you’re working with keyed messages, size for your eventual data size. We use a general guideline of keeping partitions on disk under 25 GB (for 4 days of retention - so ~6 GB of compressed messages per day). We find this gives us a good spread of data in the cluster, and represents a reasonable amount of network throughput per partition, so it allows us to scale easily. It also makes for fewer issues with replication within the cluster, and mirroring to other clusters. Outside of a guideline like that, partition based on how you want to spread out your keys. We have a user who wanted 720 partitions for a given topic because it has a large number of factors, which allows them to run a variety of counts of consumers and have balanced load. As far as multiple disks goes, yes, Kafka can make use of multiple log dirs. However, there are caveats. It’s fairly naive about how it assigns partitions to disks, and partitions are assigned by the controller to a broker with no knowledge of the disks underneath. The broker then makes the assignment to a single disk. In addition, there’s no tool for moving partitions from one mount point to another without shutting down the broker and doing it manually. -Todd On Tue, Dec 1, 2015 at 4:31 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > Hello, > > I want to size the kafka cluster with just one topic and I'm going to > process the data with Spark and others applications. > > If I have six hard drives per node, is it kafka smart enough to deal with > them? I guess that the memory should be very important in this point and > all data is cached in memory. Is it possible to config kafka to use many > directories as HDFS, each one with a different disk? > > I'm not sure about the number of partitions either. I have read > > http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ > and they talk about number of partitions much higher that I had thought. Is > it normal to have a topic with 1000 partitions? I was thinking about about > two/four partitions per node. is it wrong my thought? > > As I'm going to process data with Spark, I could have numberPartitions > equals numberExecutors in Spark as max, always thinking in the future and > sizing higher than that. > -- *—-* *Todd Palino* Staff Site Reliability Engineer Data Infrastructure Streaming linkedin.com/in/toddpalino