Jun's post is a good start, but I find it's easier to talk in terms
of more
concrete reasons and guidance for having fewer or more partitions per
topic.
Start with the number of brokers in the cluster. This is a good baseline
for the minimum number of partitions in a topic, as it will assure
balance
over the cluster. Of course, if you have lots of topics, you can
potentially skip past this as you'll end up with balanced load in the
aggregate, but I think it's a good practice regardless. As with all
other
advice here, there are always exceptions. If you really, really, really
need to assure ordering of messages, you might be stuck with a single
partition for some use cases.
In general, you should pick more partitions if a) the topic is very
busy,
or b) you have more consumers. Looking at the second case first, you
always
want to have at least as many partitions in a topic as you have
individual
consumers in a consumer group. So if you have 16 consumers in a single
group, you will want the topic they consume to have at least 16
partitions.
In fact, you may also want to always have a multiple of the number of
consumers so that you have even distribution. How many consumers you
have
in a group is going to be driven more by what you do with the
messages once
they are consumed, so here you'll be looking from the bottom of your
stack
up, until you get to Kafka.
How busy the topic is is looking from the top down, through the
producer,
to Kafka. It's also a little more difficult to provide guidance on.
We have
a policy of expanding partitions for a topic whenever the size of the
partition on disk (full retention over 4 days) is larger than 50 GB. We
find that this gives us a few benefits. One is that it takes a
reasonable
amount of time when we need to move a partition from one broker to
another.
Another is that when we have partitions that are larger than this,
the rate
tends to cause problems with consumers. For example, we see mirror maker
perform much better, and have less spiky lag problems, when we stay
under
this limit. We're even considering revising the limit down a little, as
we've had some reports from other wildcard consumers that they've had
problems keeping up with topics that have partitions larger than
about 30
GB.
The last thing to look at is whether or not you are producing keyed
messages to the topic. If you're working with unkeyed messages, there
is no
problem. You can usually add partitions whenever you want to down the
road
with little coordination with producers and consumers. If you are
producing
keyed messages, there is a good chance you do not want to change the
distribution of keys to partitions at various points in the future
when you
need to size up. This means that when you first create the topic, you
probably want to create it with enough partitions to deal with growth
over
time, both on the produce and consume side, even if that is too many
partitions right now by other measures. For example, we have one
client who
requested 720 partitions for a particular set of topics. The
reasoning was
that they are producing keyed messages, they wanted to account for
growth,
and they wanted even distribution of the partitions to consumers as they
grow. 720 happens to have a lot of factors, so it was a good number for
them to pick.
As a note, we have up to 5000 partitions per broker right now on current
hardware, and we're moving to new hardware (more disk, 256 GB of memory,
10gig interfaces) where we're going to have up to 12,000. Our default
partition count for most clusters is 8, and we've got topics up to 512
partitions in some places just taking into account the produce rate
alone
(not counting those 720-partition topics that aren't that busy). Many of
our brokers run with over 10k open file handles for regular files alone,
and over 50k open when you include network.
-Todd
On Fri, Sep 4, 2015 at 8:11 AM, tao xiao <xiaotao...@gmail.com> wrote:
Here is a good doc to describe how to choose the right number of
partitions
http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
On Fri, Sep 4, 2015 at 10:08 PM, Jörg Wagner <joerg.wagn...@1und1.de>
wrote:
Hello!
Regarding the recommended amount of partitions I am a bit confused.
Basically I got the impression that it's better to have lots of
partitions
(see information from linkedin etc). On the other hand, a lot of
performance benchmarks floating around show only a few partitions are
being
used.
Especially when considering the difference between hdd and ssds and
also
the amount thereof, what is the way to go?
In my case, I seem to have the best stability and performance
issues with
few partitions *per hdd*, and only one io thread per disk.
What are your experiences and recommendations?
Cheers
Jörg
--
Regards,
Tao