Re: Optimal number of partitions for topic

Carles Sistare Thu, 21 May 2015 11:49:08 -0700

Thanks a lot guys for your answers,
We’ll be doing some benchmarks comparing different amount of partitions for the 
same load. We’ll share the results.


Cheers

> On 21 May 2015, at 06:04, Saladi Naidu <naidusp2...@yahoo.com.INVALID> wrote:
> 
> In general partitions are to improve throughput by parallelism. From your 
> explanation below yes partitions are written to different physical locations 
> but still append only. With write ahead buffering and append only writes, 
> having partitions still will increase throughput.
> Below is an excellent article by Jun Rao about partitions and its impact 
> How to choose the number of topics/partitions in a Kafka cluster?
> 
> |   |
> |   |  |   |   |   |   |   |
> | How to choose the number of topics/partitions in a Kafka...This is a common 
> question asked by many Kafka users. The goal of this post is to explain a few 
> important determining factors and provide a few simple formulas. More... |
> |  |
> | View on blog.confluent.io | Preview by Yahoo |
> |  |
> |   |
> 
>   Naidu Saladi 
> 
>      From: Daniel Compton <daniel.compton.li...@gmail.com>
> To: users@kafka.apache.org 
> Sent: Wednesday, May 20, 2015 8:21 PM
> Subject: Re: Optimal number of partitions for topic
> 
> One of the beautiful things about Kafka is that it uses the disk and OS
> disk caching really efficiently.
> 
> Because Kafka writes messages to a contiguous log, it needs very little
> seek time to move the write head to the next point. Similarly for reading,
> if the consumers are mostly up to date with the topic then the disk head
> will be close to the reading point, or the data will already be in disk
> cache and can be read from memory. Even cooler, because disk access is so
> predictable, the OS can easily prefetch data because it knows what you're
> going to ask for in 50ms.
> 
> In summary, data locality rocks.
> 
> What I haven't worked out is how the data locality benefits interact with
> having multiple partitions. I would assume that this would make things
> slower because you will be writing to multiple physical locations on disk,
> though this may be ameliorated by only fsyncing every n seconds.
> 
> I'd be keen to hear how this impacts it, or even better to see some
> benchmarks.
> 
> --
> Daniel.
> 
> 
> 
> 
> On Thu, 21 May 2015 at 12:56 pm Manoj Khangaonkar <khangaon...@gmail.com>
> wrote:
> 
>> With knowing the actual implementation details, I would get guess more
>> partitions implies more parallelism, more concurrency, more threads, more
>> files to write to - all of which will contribute to more CPU load.
>> 
>> Partitions allow you to scale by partitioning the topic across multiple
>> brokers. Partition is also a unit of replication ( 1 leader + replicas ).
>> And for consumption of messages, the order is maintained within a
>> partitions.
>> 
>> But if you put 100 partitions per topic on 1 single broker, I wonder if it
>> is going to be an overhead.
>> 
>> 
>> 
>> On Wed, May 20, 2015 at 1:02 AM, Carles Sistare <car...@ogury.co> wrote:
>> 
>>> Hi,
>>> We are implementing a Kafka cluster with 9 brokers into EC2 instances,
>> and
>>> we are trying to find out the optimal number of partitions for our
>> topics,
>>> finding out the maximal number in order not to update the partition
>> number
>>> anymore.
>>> What we understood is that the number of partitions shouldn’t affect the
>>> CPU load of the brokers, but when we add 512 partitions instead of 128,
>> for
>>> instance, the CPU load exploses.
>>> We have three topics with 100000 messages/sec each, a replication factor
>>> of 3 and two consumer groups for each partition.
>>> 
>>> Could somebody explain, why the increase of the number of partitions has
>> a
>>> so dramatic impact to the CPU load?
>>> 
>>> 
>>> Here under i paste the config file of kafka:
>>> 
>>> broker.id=3
>>> 
>>> default.replication.factor=3
>>> 
>>> 
>>> # The port the socket server listens on
>>> port=9092
>>> 
>>> # The number of threads handling network requests
>>> num.network.threads=2
>>> 
>>> # The number of threads doing disk I/O
>>> num.io.threads=8
>>> 
>>> # The send buffer (SO_SNDBUF) used by the socket server
>>> socket.send.buffer.bytes=1048576
>>> 
>>> # The receive buffer (SO_RCVBUF) used by the socket server
>>> socket.receive.buffer.bytes=1048576
>>> 
>>> # The maximum size of a request that the socket server will accept
>>> (protection against OOM)
>>> socket.request.max.bytes=104857600
>>> 
>>> 
>>> 
>>> # A comma seperated list of directories under which to store log files
>>> log.dirs=/mnt/kafka-logs
>>> 
>>> # The default number of log partitions per topic. More partitions allow
>>> greater
>>> # parallelism for consumption, but this will also result in more files
>>> across
>>> # the brokers.
>>> num.partitions=16
>>> 
>>> # The minimum age of a log file to be eligible for deletion
>>> log.retention.hours=1
>>> 
>>> # The maximum size of a log segment file. When this size is reached a new
>>> log segment will be created.
>>> log.segment.bytes=536870912
>>> 
>>> # The interval at which log segments are checked to see if they can be
>>> deleted according
>>> # to the retention policies
>>> log.retention.check.interval.ms=60000
>>> 
>>> # By default the log cleaner is disabled and the log retention policy
>> will
>>> default to just delete segments after their retention expires.
>>> # If log.cleaner.enable=true is set the cleaner will be enabled and
>>> individual logs can then be marked for log compaction.
>>> log.cleaner.enable=false
>>> 
>>> # Timeout in ms for connecting to zookeeper
>>> zookeeper.connection.timeout.ms=1000000
>>> 
>>> auto.leader.rebalance.enable=true
>>> controlled.shutdown.enable=true
>>> 
>>> 
>>> Thanks in advance.
>>> 
>>> 
>>> 
>>> Carles Sistare
>>> 
>>> 
>>> 
>> 
>> 
>> --
>> http://khangaonkar.blogspot.com/
>> 
>

Re: Optimal number of partitions for topic

Reply via email to