> BOB> We are using RAID10. It was a requirement from our Unix guys. The > rationale for this was we didn't want to lose just a disk and to have to > rebuild/re-replicate 20TB of data. We haven't experienced any drive failures > that I am aware of. We have had complete server failures, but the data was > still good. I believe we have 10-4TB drives in a RAID10 configuration. I/O > performance is very good.
Just curious, would losing one disk in a JBOD setup really mean you’d have to re-replicate 20TB of data? If a single drive dies, wouldn’t you only lose the partitions that happen to be on that drive? On Apr 17, 2014, at 8:00 PM, Bello, Bob <bob.be...@dish.com> wrote: > Some feedback from your feedback. > > BERT> We have several uses cases we are looking at kafka for. Today we are > just using the file system to buffer data between our systems. We are > looking at uses cases that have varying message sizes of 200, 300, 1000, > 2200 bytes > > BOB> Since you are using small message size, watch out or large index files. > You can stuff a lot of messages in to the default log file size of 512MB. We > use 1GB log files before rolling them. > > > BERT> The use case we are looking at currently has hourly peaks of about > 450K messages per second. For sizing we want to make sure we can support > 900K . Our larger feed in terms of size peaks at 450MBsec so we want to > make sure the cluster we build can support 900MBsec > > BOB> I believe LinkedIn has reported getting a throughput of 900k messages > though a 6 node cluster. If you can achieve a flush rate of 100MB/s (which is > easy for a good RAID setup) having a 12 node cluster should be doable. > Remember when your topic/partition leadership is balance across the cluster > (preferred replica election) you get to take advantage of all the brokers. > Don't forget to architect for a failures. Can your cluster handle max > throughput with two Kafka broker in an offline state? > > BERT> Are you implying that the number of topics has direct correlation to > the fail-over time? I think I might test this by creating one topic > loading 500 million rows and test failover and compare to 500 topics with 1 > million rows each. Not sure if data in the Q impacts the failover so > figured I would test that also. > > BOB> Yes, that is what we have seen. The current controller architecture > takes longer for Kafka nodes to fail over. It's not the # of topics, but the > # of topic/partitions that have to move over. When a Kafka broker fails > (planed or unplanned), the producers and the consumers have to pause for all > the topic/partition pairs that were the leader for the off line Kafka broker > and they have to move to another Kafka broker that is in ISR. By having lots > of topics/partitions (we have many thousands), it can take a bit. Remember > it's only a chunk, not all topics and partitions. This of course can change > as the Kafka development team changes how this works. I highly recommend > creating your topic and partition counts in DEV/QA and test this out. You > will see a difference. > > As for the amout of data in the topic/partition that is of no concern for > failover. The Kafka broker will only failover those topics/partitions that > are in ISR. Replication time once a Kafka broker is brought back online will > depend on how far behind the Kafka broker is from the leader. This is delta > in the offset. Planned shutdowns can be minutes, unplanned shutdowns/failures > can take hours for our data to re-replicate. > > > BERT> Our default config config has a 256GB of memory also. One thing I > do want to test is impact on cluster of reading data not in memory. Have > you done any testing like this? > > BOB> Yes, it's about putting enough data to flush outside the OS file cache. > But 512GB of data in your topics to make sure the data is not in the cache. > Also, you can reset and/or use new consumer groups and make sure you read > from the lowest-offset. Watch your iostats to see if you get lots of reads. > On a normal Kafka cluster that is reading cached memory (for consumption), > you will not see read IO. Assuming you don't have other processes on the > system reading data (such as log aggregation). We see 30MB writes/flushes > ever-other-second with 1-2% IO utilization. > > > BERT> We have not determined what to use just yet for monitoring. What > are you guys using? > > BOB> We are using a commercial APM solution. It's an java agent the plugs > into the JVM on boot time. This reads the JMX information as well as file I/O > rates, NIO rates and GC. It sends to a centralized monitoring console. Google > "Java APM" for some ideas. > > > BERT> Can you share more about your config? Are you using RAID10 or > RAID5? What size and speed of drives? Have you needed to do a RAID > rebuild and if so did it negatively impact the cluster. The standard > server I was given has 12 x 4TB 7.2K drives. I will either run in JBOD or > as RAID10. Parity based RAID with 4TB drives makes me nervous. I am not > worried about performance when things are working as designed...we need to > plan for edge cases when consumer is reading old data or the system needs > to play catch up on a big backlog. > > BOB> We are using RAID10. It was a requirement from our Unix guys. The > rationale for this was we didn't want to lose just a disk and to have to > rebuild/re-replicate 20TB of data. We haven't experienced any drive failures > that I am aware of. We have had complete server failures, but the data was > still good. I believe we have 10-4TB drives in a RAID10 configuration. I/O > performance is very good. > > BERT> Need to spend some time on zookeeper. I have not looked at > zookeeper performance to see if its negatively impacting the performance > tests I am doing. We haven't spent any time looking at zookeeper. Did you > find that the SSD helped improve kafka performance? > > BOB> We started with SSD. Kafka brokers itself doesn't write a lot of data > frequently (to zookeeper). It's really about how your consumers flush their > offsets. This is assuming you will be using the high-level consumer client. > If you are going to flush the offsets to zookeeper on every message consumed > (to get best effort nearly-exactly-once processing). You will being writing a > lot of data to zookeeper. On our 5 node zookeeper cluster, we are doing 300+ > writes per second, and can spike up to many 1000's. Typically it's 1-2MBs > data rate. The SSDs are under 2% I/O utilization. 200MB of ZK data, and we > clean up the files once per hour. We run some consumers in batch and flush on > time delay. Other consumers are flush per message processed. It's the flush > per message that causes the high-volume. > > Push back on DEVs and software architecture if they want to flush per > message. Do it where it's only absolutely necessary. :) > > The high level Kafka consumer is good at "at least once" processing. Exactly > once is a harder nut to crack. Exactly once processing may require some > custom code around the low-level Kafka consumer client. > > - Bob > > > > -----Original Message----- > From: bertc...@gmail.com [mailto:bertc...@gmail.com] On Behalf Of Bert > Corderman > Sent: Thursday, April 17, 2014 7:21 AM > To: users@kafka.apache.org > Subject: Re: Cluster design distribution and JBOD vs RAID > > Hey Bob, > > thanks for your detailed response. I have added comments inline. > > > On Wed, Apr 16, 2014 at 7:41 PM, Bello, Bob <bob.be...@dish.com> wrote: > >> Perhaps as you consider the size of your cluster, a few questions about >> the kind of messaging you are looking at? I can use an example of what we >> do in our production environment while not going into specifics. These are >> just observations from an OPS perspective. (sorry for the wall of text.) >> >> * Size of messages (<100 bytes, <1kB, <10kB, <100kB, <1MB, <10MB, etc). >> (we run messages size between a few byes to over 100KB with a few at over >> 1MB). >> > BERT> We have several uses cases we are looking at kafka for. Today we are > just using the file system to buffer data between our systems. We are > looking at uses cases that have varying message sizes of 200, 300, 1000, > 2200 bytes > >> >> * Volume of messages per second (we produce over 15k per second and can >> consume over 100K per second when we are processing though some lag) >> > BERT> The use case we are looking at currently has hourly peaks of about > 450K messages per second. For sizing we want to make sure we can support > 900K . Our larger feed in terms of size peaks at 450MBsec so we want to > make sure the cluster we build can support 900MBsec > >> >> * # of Producer clients (a few, a lot) (we have over 300 app servers the >> produce messages to the Kafka cluster) >> ** Not only does this affect Kafka broker performance but it can use a lot >> of TCP connections specially if you run a large Kafka cluster >> > BERT> our producer count will be low ...maybe 8-16 hosts. > >> >> * # of Consumer clients (a few, a lot) (we have less than 50 app servers >> that consume at this time) >> ** This also affects the # of TCP connections to Kafka brokers. (We have >> over 2400+ TCP connections to our cluster) >> > BERT> This will be much higher but not sure yet. We are also looking at > replacing some legacy technology with storm so this is a bit up in the air > right now. > >> >> * Will you compress your message before sending them to Kafka? (we have a >> mix of snappy, gzip and non-compressed messages depending on the >> application). This can affect your disk usage >> > BERT> We will use whatever performs best ;) My gut is that we will be > using snappy > >> >> * Planned retention period. Longer retention period = more storage >> required. (we have varied retention periods per topic, between 10 days and >> 30 days). >> >> * The number of topics per cluster. I believe Kafka scales well with the >> number of topics, however you have to worry about a few things: >> ** More topics, means slower migration/failover when Kafka brokers are >> shutdown or fail. This has caused us time out issues. Planned shutdown of a >> Kafka broker can take over 30 seconds to over 3 minutes. (We have over >10 >> and <50 topics. We are growing topics rapidly.) >> > BERT> Are you implying that the number of topics has direct correlation to > the fail-over time? I think I might test this by creating one topic > loading 500 million rows and test failover adn compare to 500 topics with 1 > million rows each. Not sure if data in the Q impacts the failover so > figured I would test that also. > >> >> * The number of partitions per topic. More partitions per topic = more >> open file handles, (2 per log file, one for data and one more the index). >> We run average of 130 partitions. You have to consider your cardinality for >> your messages if order is important. Can you use a key that allows a good >> distribution across partitions while maintaining order? If all your message >> end up in just a few partitions within the topic then it's harder scale the >> consumption. This all depends on your use case. >> > BERT> We are lucky that order is not critical for our large feeds. > >> >> It might sound like good rationale to scale the # of partitions for a >> topic to a huge number (for just in case). I think it all depends. >> >> * How many consumer threads can consume a single topic? You can't go wider >> than the # of partitions however Kafka clients easily work with a large # >> of partitions with a few consumer threads. >> >> * Producer vs. Consumer size. Is your messaging flow Producer or Consumer >> heavy. Kafka is awesome and sending data to consumers that use "recent" >> data. Since Kafka uses memory mapped files, any data from Kafka that is in >> RAM will be very fast. (Our servers have 256GB of ram on them). >> > BERT> Our default config config has a 256GB of memory also. One thing I > do want to test is impact on cluster of reading data not in memory. Have > you done any testing like this? > >> >> * Size of your cluster vs. the # of replicas. Larger # of Kafka brokers >> means more chance of failure within the cluster. Same kind of reason why >> you generally won't see a large RAID5 array. You get one failure before you >> lose data. If you decide to run a large cluster and # of replicas will be >> important. How much risk are you willing to take? (We run a 6 node cluster >> with a replica factor of 3. We can lose a total of two nodes before losing >> data). >> > BERT> Thanks for the datapoint. We were also planning to go with > replication factor of 3 > >> >> * Are you running on native iron or virtualized? VM is generally lower >> performance but can generally spin up new instances faster upon failure. We >> run on native iron so we get excellent performance at the cost of longer >> lead times to provision new Kafka brokers. >> > BERT> We are big fans of vms...however kafka will be on physical > >> >> * Networking. Are you are running 100mbit, 1gig or 10gib? You can only >> produce and consume so much data. Larger clusters let you run a total >> aggregate bandwidth. Don't forget about replication! Topic/partition >> leaders must replicate to all replica Kafkabrokers (hub/spoke). How long >> can you wait for replication to occur after a planned or un-planned outage? >> (We run >1Gig). >> > BERT> 10gb....so cheap now. I did cost analysis and found that a single > 10gb port costs about the same as 2 x 1gig. Five times the bandwidth and > less latency makes it no brainer. If your kafka hosts have multiple nics > make sure they are using the right port. This one bit me for a little. > (hostname config in the broker config) > >> >> * Monitoring. Large # of Kafka brokers means more to monitor. Do you have >> a centralized monitoring app? Kafka provides a lot (huge!) JMX information. >> Making sense of it all can take some time. >> > BERT> We have not determined what to use just yet for monitoring. What > are you guys using? > > >> * Disk I/O. JBOD vs. RAID. How much are you willing to tolerate failures? >> Do you have provisioned IO? (We run native iron and local disk in a RAID >> configuration. It was easier for us to manage a single mount point than a >> bunch in a JBOD configuration. We rely of local RAID and Kafka replication >> to keep enough copies of our data. We have a large amount of disk capacity. >> We can tolerate large re-replication events due to broker failure without >> affecting producer or consumer performance.) >> > BERT> Can you share more about your config? Are you using RAID10 or > RAID5? What size and speed of drives? Have you needed to do a RAID > rebuild and if so did it negatively impact the cluster. The standard > server I was given has 12 x 4TB 7.2K drives. I will either run in JBOD or > as RAID10. Parity based RAID with 4TB drives makes me nervous. I am not > worried about performance when things are working as designed...we need to > plan for edge cases when consumer is reading old data or the system needs > to play catch up on a big backlog. > >> >> * Disk capacity / Kafka Broker capacity. Depending on your volume, message >> size and retention period, how much disk space will you need? (Using our >> "crystal ball tech(tm)" we decided over 20TB per Kafka broker would meet >> our needs. We will probably add Kafka brokers over adding disk as we >> outgrow this.) >> > BERT> I need a crystal ball ;) > >> >> * Separate clusters to keep information separated? Do you have a use case >> for keeping customer data separate? Compliance use cases such as PCI or >> SOX? This may be a good reason to keep separate Kafka clusters. I assume >> that you already will keep separate clusters for DEV/QA/PROD. >> > BERT> yes DEV/QA/PROD completely separate > >> >> * Zookeeper performance - 3 node, 5 node or 7 node. Less nodes, better >> performance. More nodes, better failure tolerance. We run 5 nodes with the >> transaction logs on SSD. Our ZK update performance is very good. >> > BERT> Need to spend some time on zookeeper. I have not looked at > zookeeper performance to see if its negatively impacting the performance > tests I am doing. We haven't spent any time looking at zookeeper. Did you > find that the SSD helped improve kafka performance? > >> >> # of partitions per Topic debate: >> Personally, I'm a proponent of larger # of partitions per topic without >> going way large. You can add Kafka Brokers to increase capacity and get >> more performance. However though it's possible to add partitions after a >> topic is created, it can cause issues with your key hashing depending on >> your message architecture. >> >> * Increasing # of brokers = easy >> * Increasing the # of partitions in a topic with data in it = hard >> >> For us, we will be adding more topics and as we add additional messaging >> functionality. >> >> Example: >> >> 130 partitions per topic / 6 brokers = 5 leader partitions per broker per >> topic. If you replicate 3 the you will end up with 3x active partitions per >> broker. >> >> 1024 partitions per topic / 24 brokers =~ 43 leader partitions per broker >> per topic. >> > BERT> Thanks for the example. Good to see others are using larger > partition counts. > >> >> >> Final thoughts: >> >> There's no magical formula for this as already stated in the wiki. It is a >> lot of trial and error. I will say that we went from a few 100 messages per >> second volume to over 40k per second by adding one application and our >> Kafka cluster didn't even blink. >> >> Kafka is awesome. >> >> Btw, we're running 0.8.0. >> >> >> >> - Bob >> >> -----Original Message----- >> From: bertc...@gmail.com [mailto:bertc...@gmail.com] On Behalf Of Bert >> Corderman >> Sent: Wednesday, April 16, 2014 11:58 AM >> To: users@kafka.apache.org >> Subject: Cluster design distribution and JBOD vs RAID >> >> I am wondering what others are doing in terms of cluster separation. (if at >> all) For example let’s say I need 24 nodes to support a given workload. >> What are the tradeoffs between a single 24 node cluster vs 2 x 12 node >> clusters for example. The application I support can support separation of >> data fairly easily as the data is all processed in the same way but can be >> sharded isolated based on customers. I understand the standard tradeoffs, >> for example putting all your eggs in one basket but curious as if there are >> any details specific to Kafka in terms of cluster scale out. >> >> >> >> Somewhat related is the use of RAID vs JBOD, I have reviewed the documents >> on the Kafka site and understand the tradeoff between space as well as >> sequential IO vs random and the fact a RAID rebuild might kill the system. >> I am specifically asking the question as it relates to larger cluster and >> the impact on the number of partitions a topic might need. >> >> >> >> Take an example of a 24 node cluster with 12 drives each the cluster would >> have 288 drives. To ensure a topic is distributed across all drives a >> topic would require 288 partitions. I am planning to test some of this but >> wanted to know if there was a rule of thumb. The following link >> >> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIchoosethenumberofpartitionsforatopic >> ? >> Talks about supporting up to 10K partitions but its not clear if this is >> for a cluster as a whole vs topic based >> >> >> Those of you running larger clusters what are you doing? >> >> >> Bert >>