Thanks a lot guys for your answers, We’ll be doing some benchmarks comparing different amount of partitions for the same load. We’ll share the results.
Cheers > On 21 May 2015, at 06:04, Saladi Naidu <naidusp2...@yahoo.com.INVALID> wrote: > > In general partitions are to improve throughput by parallelism. From your > explanation below yes partitions are written to different physical locations > but still append only. With write ahead buffering and append only writes, > having partitions still will increase throughput. > Below is an excellent article by Jun Rao about partitions and its impact > How to choose the number of topics/partitions in a Kafka cluster? > > | | > | | | | | | | | > | How to choose the number of topics/partitions in a Kafka...This is a common > question asked by many Kafka users. The goal of this post is to explain a few > important determining factors and provide a few simple formulas. More... | > | | > | View on blog.confluent.io | Preview by Yahoo | > | | > | | > > Naidu Saladi > > From: Daniel Compton <daniel.compton.li...@gmail.com> > To: users@kafka.apache.org > Sent: Wednesday, May 20, 2015 8:21 PM > Subject: Re: Optimal number of partitions for topic > > One of the beautiful things about Kafka is that it uses the disk and OS > disk caching really efficiently. > > Because Kafka writes messages to a contiguous log, it needs very little > seek time to move the write head to the next point. Similarly for reading, > if the consumers are mostly up to date with the topic then the disk head > will be close to the reading point, or the data will already be in disk > cache and can be read from memory. Even cooler, because disk access is so > predictable, the OS can easily prefetch data because it knows what you're > going to ask for in 50ms. > > In summary, data locality rocks. > > What I haven't worked out is how the data locality benefits interact with > having multiple partitions. I would assume that this would make things > slower because you will be writing to multiple physical locations on disk, > though this may be ameliorated by only fsyncing every n seconds. > > I'd be keen to hear how this impacts it, or even better to see some > benchmarks. > > -- > Daniel. > > > > > On Thu, 21 May 2015 at 12:56 pm Manoj Khangaonkar <khangaon...@gmail.com> > wrote: > >> With knowing the actual implementation details, I would get guess more >> partitions implies more parallelism, more concurrency, more threads, more >> files to write to - all of which will contribute to more CPU load. >> >> Partitions allow you to scale by partitioning the topic across multiple >> brokers. Partition is also a unit of replication ( 1 leader + replicas ). >> And for consumption of messages, the order is maintained within a >> partitions. >> >> But if you put 100 partitions per topic on 1 single broker, I wonder if it >> is going to be an overhead. >> >> >> >> On Wed, May 20, 2015 at 1:02 AM, Carles Sistare <car...@ogury.co> wrote: >> >>> Hi, >>> We are implementing a Kafka cluster with 9 brokers into EC2 instances, >> and >>> we are trying to find out the optimal number of partitions for our >> topics, >>> finding out the maximal number in order not to update the partition >> number >>> anymore. >>> What we understood is that the number of partitions shouldn’t affect the >>> CPU load of the brokers, but when we add 512 partitions instead of 128, >> for >>> instance, the CPU load exploses. >>> We have three topics with 100000 messages/sec each, a replication factor >>> of 3 and two consumer groups for each partition. >>> >>> Could somebody explain, why the increase of the number of partitions has >> a >>> so dramatic impact to the CPU load? >>> >>> >>> Here under i paste the config file of kafka: >>> >>> broker.id=3 >>> >>> default.replication.factor=3 >>> >>> >>> # The port the socket server listens on >>> port=9092 >>> >>> # The number of threads handling network requests >>> num.network.threads=2 >>> >>> # The number of threads doing disk I/O >>> num.io.threads=8 >>> >>> # The send buffer (SO_SNDBUF) used by the socket server >>> socket.send.buffer.bytes=1048576 >>> >>> # The receive buffer (SO_RCVBUF) used by the socket server >>> socket.receive.buffer.bytes=1048576 >>> >>> # The maximum size of a request that the socket server will accept >>> (protection against OOM) >>> socket.request.max.bytes=104857600 >>> >>> >>> >>> # A comma seperated list of directories under which to store log files >>> log.dirs=/mnt/kafka-logs >>> >>> # The default number of log partitions per topic. More partitions allow >>> greater >>> # parallelism for consumption, but this will also result in more files >>> across >>> # the brokers. >>> num.partitions=16 >>> >>> # The minimum age of a log file to be eligible for deletion >>> log.retention.hours=1 >>> >>> # The maximum size of a log segment file. When this size is reached a new >>> log segment will be created. >>> log.segment.bytes=536870912 >>> >>> # The interval at which log segments are checked to see if they can be >>> deleted according >>> # to the retention policies >>> log.retention.check.interval.ms=60000 >>> >>> # By default the log cleaner is disabled and the log retention policy >> will >>> default to just delete segments after their retention expires. >>> # If log.cleaner.enable=true is set the cleaner will be enabled and >>> individual logs can then be marked for log compaction. >>> log.cleaner.enable=false >>> >>> # Timeout in ms for connecting to zookeeper >>> zookeeper.connection.timeout.ms=1000000 >>> >>> auto.leader.rebalance.enable=true >>> controlled.shutdown.enable=true >>> >>> >>> Thanks in advance. >>> >>> >>> >>> Carles Sistare >>> >>> >>> >> >> >> -- >> http://khangaonkar.blogspot.com/ >> >