RE: Cluster design distribution and JBOD vs RAID

Bello, Bob Wed, 16 Apr 2014 16:42:12 -0700

Perhaps as you consider the size of your cluster, a few questions about the 
kind of messaging you are looking at? I can use an example of what we do in our 
production environment while not going into specifics. These are just 
observations from an OPS perspective. (sorry for the wall of text.)


* Size of messages (<100 bytes, <1kB, <10kB, <100kB, <1MB, <10MB, etc). (we run 
messages size between a few byes to over 100KB with a few at over 1MB). 

* Volume of messages per second (we produce over 15k per second and can consume 
over 100K per second when we are processing though some lag)

* # of Producer clients (a few, a lot) (we have over 300 app servers the 
produce messages to the Kafka cluster) 
** Not only does this affect Kafka broker performance but it can use a lot of 
TCP connections specially if you run a large Kafka cluster

* # of Consumer clients (a few, a lot) (we have less than 50 app servers that 
consume at this time)
** This also affects the # of TCP connections to Kafka brokers. (We have over 
2400+ TCP connections to our cluster)

* Will you compress your message before sending them to Kafka? (we have a mix 
of snappy, gzip and non-compressed messages depending on the application). This 
can affect your disk usage

* Planned retention period. Longer retention period = more storage required. 
(we have varied retention periods per topic, between 10 days and 30 days).

* The number of topics per cluster. I believe Kafka scales well with the number 
of topics, however you have to worry about a few things:
** More topics, means slower migration/failover when Kafka brokers are shutdown 
or fail. This has caused us time out issues. Planned shutdown of a Kafka broker 
can take over 30 seconds to over 3 minutes. (We have over >10 and <50 topics. 
We are growing topics rapidly.)

* The number of partitions per topic. More partitions per topic = more open 
file handles, (2 per log file, one for data and one more the index). We run 
average of 130 partitions. You have to consider your cardinality for your 
messages if order is important. Can you use a key that allows a good 
distribution across partitions while maintaining order? If all your message end 
up in just a few partitions within the topic then it's harder scale the 
consumption. This all depends on your use case.

It might sound like good rationale to scale the # of partitions for a topic to 
a huge number (for just in case). I think it all depends.

* How many consumer threads can consume a single topic? You can't go wider than 
the # of partitions however Kafka clients easily work with a large # of 
partitions with a few consumer threads. 

* Producer vs. Consumer size. Is your messaging flow Producer or Consumer 
heavy. Kafka is awesome and sending data to consumers that use "recent" data. 
Since Kafka uses memory mapped files, any data from Kafka that is in RAM will 
be very fast. (Our servers have 256GB of ram on them).

* Size of your cluster vs. the # of replicas. Larger # of Kafka brokers means 
more chance of failure within the cluster. Same kind of reason why you 
generally won't see a large RAID5 array. You get one failure before you lose 
data. If you decide to run a large cluster and # of replicas will be important. 
How much risk are you willing to take? (We run a 6 node cluster with a replica 
factor of 3. We can lose a total of two nodes before losing data). 

* Are you running on native iron or virtualized? VM is generally lower 
performance but can generally spin up new instances faster upon failure. We run 
on native iron so we get excellent performance at the cost of longer lead times 
to provision new Kafka brokers. 

* Networking. Are you are running 100mbit, 1gig or 10gib? You can only produce 
and consume so much data. Larger clusters let you run a total aggregate 
bandwidth. Don't forget about replication! Topic/partition leaders must 
replicate to all replica Kafkabrokers (hub/spoke). How long can you wait for 
replication to occur after a planned or un-planned outage? (We run >1Gig).

* Monitoring. Large # of Kafka brokers means more to monitor. Do you have a 
centralized monitoring app? Kafka provides a lot (huge!) JMX information. 
Making sense of it all can take some time.

* Disk I/O. JBOD vs. RAID. How much are you willing to tolerate failures? Do 
you have provisioned IO? (We run native iron and local disk in a RAID 
configuration. It was easier for us to manage a single mount point than a bunch 
in a JBOD configuration. We rely of local RAID and Kafka replication to keep 
enough copies of our data. We have a large amount of disk capacity. We can 
tolerate large re-replication events due to broker failure without affecting 
producer or consumer performance.)

* Disk capacity / Kafka Broker capacity. Depending on your volume, message size 
and retention period, how much disk space will you need? (Using our "crystal 
ball tech(tm)" we decided over 20TB per Kafka broker would meet our needs. We 
will probably add Kafka brokers over adding disk as we outgrow this.)

* Separate clusters to keep information separated? Do you have a use case for 
keeping customer data separate? Compliance use cases such as PCI or SOX? This 
may be a good reason to keep separate Kafka clusters. I assume that you already 
will keep separate clusters for DEV/QA/PROD.

* Zookeeper performance - 3 node, 5 node or 7 node. Less nodes, better 
performance. More nodes, better failure tolerance. We run 5 nodes with the 
transaction logs on SSD. Our ZK update performance is very good.

# of partitions per Topic debate:
Personally, I'm a proponent of larger # of partitions per topic without going 
way large. You can add Kafka Brokers to increase capacity and get more 
performance. However though it's possible to add partitions after a topic is 
created, it can cause issues with your key hashing depending on your message 
architecture. 

* Increasing # of brokers = easy
* Increasing the # of partitions in a topic with data in it = hard

For us, we will be adding more topics and as we add additional messaging 
functionality.

Example:

130 partitions per topic / 6 brokers = 5 leader partitions per broker per 
topic. If you replicate 3 the you will end up with 3x active partitions per 
broker.

1024 partitions per topic / 24 brokers =~ 43 leader partitions per broker per 
topic. 


Final thoughts:

There's no magical formula for this as already stated in the wiki. It is a lot 
of trial and error. I will say that we went from a few 100 messages per second 
volume to over 40k per second by adding one application and our Kafka cluster 
didn't even blink. 

Kafka is awesome.

Btw, we're running 0.8.0.



- Bob

-----Original Message-----
From: bertc...@gmail.com [mailto:bertc...@gmail.com] On Behalf Of Bert Corderman
Sent: Wednesday, April 16, 2014 11:58 AM
To: users@kafka.apache.org
Subject: Cluster design distribution and JBOD vs RAID

I am wondering what others are doing in terms of cluster separation. (if at
all)  For example let’s say I need 24 nodes to support a given workload.
What are the tradeoffs between a single 24 node cluster vs 2 x 12 node
clusters for example.  The application I support can support separation of
data fairly easily as the data is all processed in the same way but can be
sharded isolated based on customers.  I understand the standard tradeoffs,
for example putting all your eggs in one basket but curious as if there are
any details specific to Kafka in terms of cluster scale out.



Somewhat related is the use of RAID vs JBOD, I have reviewed the documents
on the Kafka site and understand the tradeoff between space as well as
sequential IO vs random and the fact a RAID rebuild might kill the system.
I am specifically asking the question as it relates to larger cluster and
the impact on the number of partitions a topic might need.



Take an example of a 24 node cluster with 12 drives each the cluster would
have 288 drives.  To ensure a topic is distributed across all drives a
topic would require 288 partitions.  I am planning to test some of this but
wanted to know if there was a rule of thumb.  The following link
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIchoosethenumberofpartitionsforatopic?
Talks about supporting up to 10K partitions but its not clear if this is
for a cluster as a whole vs topic based


Those of you running larger clusters what are you doing?


Bert

RE: Cluster design distribution and JBOD vs RAID

Reply via email to