Hey Grant, Yes, this KIP does exactly what you described:)
Thanks, Dong On Wed, Feb 1, 2017 at 9:45 AM, Grant Henke <ghe...@cloudera.com> wrote: > Hi Dong, > > Thanks for putting this together. > > Since we are discussing alternative/simplified options. Have you considered > handling the disk failures broker side to prevent a crash, marking the disk > as "bad" to that individual broker, and continuing as normal? I imagine the > broker would then fall out of sync for the replicas hosted on the bad disk > and the ISR would shrink. This would allow people using min.isr to keep > their data safe and the cluster operators would see a shrink in many ISRs > and hopefully an obvious log message leading to a quick fix. I haven't > thought through this idea in depth though. So there could be some > shortfalls. > > Thanks, > Grant > > > > On Wed, Feb 1, 2017 at 11:21 AM, Dong Lin <lindon...@gmail.com> wrote: > > > Hey Eno, > > > > Thanks much for the review. > > > > I think your suggestion is to split disks of a machine into multiple disk > > sets and run one broker per disk set. Yeah this is similar to Colin's > > suggestion of one-broker-per-disk, which we have evaluated at LinkedIn > and > > considered it to be a good short term approach. > > > > As of now I don't think any of these approach is a better alternative in > > the long term. I will summarize these here. I have put these reasons in > the > > KIP's motivation section and rejected alternative section. I am happy to > > discuss more and I would certainly like to use an alternative solution > that > > is easier to do with better performance. > > > > - JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2 > to > > JBOD with replicatio-factor=3, we get 25% reduction in disk usage and > > doubles the tolerance of broker failure before data unavailability from 1 > > to 2. This is pretty huge gain for any company that uses Kafka at large > > scale. > > > > - JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is > that > > no major code change is needed in Kafka. Among the disadvantage of > > one-broker-per-disk summarized in the KIP and previous email with Colin, > > the biggest one is the 15% throughput loss compared to JBOD and less > > flexibility to balance across disks. Further, it probably requires change > > to internal deployment tools at various companies to deal with > > one-broker-per-disk setup. > > > > - JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem > is > > that a broker becomes unavailable if any disk fail. Suppose > > replication-factor=2 and there are 10 disks per machine. Then the > > probability of of any message becomes unavailable due to disk failure > with > > RAID-0 is 100X higher than that with JBOD. > > > > - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere > > between one-broker-per-disk and RAID-0. So it carries an averaged > > disadvantages of these two approaches. > > > > To answer your question regarding, I think it is reasonable to mange disk > > in Kafka. By "managing disks" we mean the management of assignment of > > replicas across disks. Here are my reasons in more detail: > > > > - I don't think this KIP is a big step change. By allowing user to > > configure Kafka to run multiple log directories or disks as of now, it is > > implicit that Kafka manages disks. It is just not a complete feature. > > Microsoft and probably other companies are using this feature under the > > undesirable effect that a broker will fail any if any disk fail. It is > good > > to complete this feature. > > > > - I think it is reasonable to manage disk in Kafka. One of the most > > important work that Kafka is doing is to determine the replica assignment > > across brokers and make sure enough copies of a given replica is > available. > > I would argue that it is not much different than determining the replica > > assignment across disk conceptually. > > > > - I would agree that this KIP is improve performance of Kafka at the cost > > of more complexity inside Kafka, by switching from RAID-10 to JBOD. I > would > > argue that this is a right direction. If we can gain 20%+ performance by > > managing NIC in Kafka as compared to existing approach and other > > alternatives, I would say we should just do it. Such a gain in > performance, > > or equivalently reduction in cost, can save millions of dollars per year > > for any company running Kafka at large scale. > > > > Thanks, > > Dong > > > > > > On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska <eno.there...@gmail.com> > > wrote: > > > > > I'm coming somewhat late to the discussion, apologies for that. > > > > > > I'm worried about this proposal. It's moving Kafka to a world where it > > > manages disks. So in a sense, the scope of the KIP is limited, but the > > > direction it sets for Kafka is quite a big step change. Fundamentally > > this > > > is about balancing resources for a Kafka broker. This can be done by a > > > tool, rather than by changing Kafka. E.g., the tool would take a bunch > of > > > disks together, create a volume over them and export that to a Kafka > > broker > > > (in addition to setting the memory limits for that broker or limiting > > other > > > resources). A different bunch of disks can then make up a second > volume, > > > and be used by another Kafka broker. This is aligned with what Colin is > > > saying (as I understand it). > > > > > > Disks are not the only resource on a machine, there are several > instances > > > where multiple NICs are used for example. Do we want fine grained > > > management of all these resources? I'd argue that opens us the system > to > > a > > > lot of complexity. > > > > > > Thanks > > > Eno > > > > > > > > > > On 1 Feb 2017, at 01:53, Dong Lin <lindon...@gmail.com> wrote: > > > > > > > > Hi all, > > > > > > > > I am going to initiate the vote If there is no further concern with > the > > > KIP. > > > > > > > > Thanks, > > > > Dong > > > > > > > > > > > > On Fri, Jan 27, 2017 at 8:08 PM, radai <radai.rosenbl...@gmail.com> > > > wrote: > > > > > > > >> a few extra points: > > > >> > > > >> 1. broker per disk might also incur more client <--> broker sockets: > > > >> suppose every producer / consumer "talks" to >1 partition, there's a > > > very > > > >> good chance that partitions that were co-located on a single 10-disk > > > broker > > > >> would now be split between several single-disk broker processes on > the > > > same > > > >> machine. hard to put a multiplier on this, but likely >x1. sockets > > are a > > > >> limited resource at the OS level and incur some memory cost (kernel > > > >> buffers) > > > >> > > > >> 2. there's a memory overhead to spinning up a JVM (compiled code and > > > byte > > > >> code objects etc). if we assume this overhead is ~300 MB (order of > > > >> magnitude, specifics vary) than spinning up 10 JVMs would lose you 3 > > GB > > > of > > > >> RAM. not a ton, but non negligible. > > > >> > > > >> 3. there would also be some overhead downstream of kafka in any > > > management > > > >> / monitoring / log aggregation system. likely less than x10 though. > > > >> > > > >> 4. (related to above) - added complexity of administration with more > > > >> running instances. > > > >> > > > >> is anyone running kafka with anywhere near 100GB heaps? i thought > the > > > point > > > >> was to rely on kernel page cache to do the disk buffering .... > > > >> > > > >> On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin <lindon...@gmail.com> > > wrote: > > > >> > > > >>> Hey Colin, > > > >>> > > > >>> Thanks much for the comment. Please see me comment inline. > > > >>> > > > >>> On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe <cmcc...@apache.org > > > > > >> wrote: > > > >>> > > > >>>> On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote: > > > >>>>> Hey Colin, > > > >>>>> > > > >>>>> Good point! Yeah we have actually considered and tested this > > > >> solution, > > > >>>>> which we call one-broker-per-disk. It would work and should > require > > > >> no > > > >>>>> major change in Kafka as compared to this JBOD KIP. So it would > be > > a > > > >>> good > > > >>>>> short term solution. > > > >>>>> > > > >>>>> But it has a few drawbacks which makes it less desirable in the > > long > > > >>>>> term. > > > >>>>> Assume we have 10 disks on a machine. Here are the problems: > > > >>>> > > > >>>> Hi Dong, > > > >>>> > > > >>>> Thanks for the thoughtful reply. > > > >>>> > > > >>>>> > > > >>>>> 1) Our stress test result shows that one-broker-per-disk has 15% > > > >> lower > > > >>>>> throughput > > > >>>>> > > > >>>>> 2) Controller would need to send 10X as many LeaderAndIsrRequest, > > > >>>>> MetadataUpdateRequest and StopReplicaRequest. This increases the > > > >> burden > > > >>>>> on > > > >>>>> controller which can be the performance bottleneck. > > > >>>> > > > >>>> Maybe I'm misunderstanding something, but there would not be 10x > as > > > >> many > > > >>>> StopReplicaRequest RPCs, would there? The other requests would > > > >> increase > > > >>>> 10x, but from a pretty low base, right? We are not reassigning > > > >>>> partitions all the time, I hope (or else we have bigger > problems...) > > > >>>> > > > >>> > > > >>> I think the controller will group StopReplicaRequest per broker and > > > send > > > >>> only one StopReplicaRequest to a broker during controlled shutdown. > > > >> Anyway, > > > >>> we don't have to worry about this if we agree that other requests > > will > > > >>> increase by 10X. One MetadataRequest to send to each broker in the > > > >> cluster > > > >>> every time there is leadership change. I am not sure this is a real > > > >>> problem. But in theory this makes the overhead complexity O(number > of > > > >>> broker) and may be a concern in the future. Ideally we should avoid > > it. > > > >>> > > > >>> > > > >>>> > > > >>>>> > > > >>>>> 3) Less efficient use of physical resource on the machine. The > > number > > > >>> of > > > >>>>> socket on each machine will increase by 10X. The number of > > connection > > > >>>>> between any two machine will increase by 100X. > > > >>>>> > > > >>>>> 4) Less efficient way to management memory and quota. > > > >>>>> > > > >>>>> 5) Rebalance between disks/brokers on the same machine will less > > > >>>>> efficient > > > >>>>> and less flexible. Broker has to read data from another broker on > > the > > > >>>>> same > > > >>>>> machine via socket. It is also harder to do automatic load > balance > > > >>>>> between > > > >>>>> disks on the same machine in the future. > > > >>>>> > > > >>>>> I will put this and the explanation in the rejected alternative > > > >>> section. > > > >>>>> I > > > >>>>> have a few questions: > > > >>>>> > > > >>>>> - Can you explain why this solution can help avoid scalability > > > >>>>> bottleneck? > > > >>>>> I actually think it will exacerbate the scalability problem due > the > > > >> 2) > > > >>>>> above. > > > >>>>> - Why can we push more RPC with this solution? > > > >>>> > > > >>>> To really answer this question we'd have to take a deep dive into > > the > > > >>>> locking of the broker and figure out how effectively it can > > > parallelize > > > >>>> truly independent requests. Almost every multithreaded process is > > > >> going > > > >>>> to have shared state, like shared queues or shared sockets, that > is > > > >>>> going to make scaling less than linear when you add disks or > > > >> processors. > > > >>>> (And clearly, another option is to improve that scalability, > rather > > > >>>> than going multi-process!) > > > >>>> > > > >>> > > > >>> Yeah I also think it is better to improve scalability inside kafka > > code > > > >> if > > > >>> possible. I am not sure we currently have any scalability issue > > inside > > > >>> Kafka that can not be removed without using multi-process. > > > >>> > > > >>> > > > >>>> > > > >>>>> - It is true that a garbage collection in one broker would not > > affect > > > >>>>> others. But that is after every broker only uses 1/10 of the > > memory. > > > >>> Can > > > >>>>> we be sure that this will actually help performance? > > > >>>> > > > >>>> The big question is, how much memory do Kafka brokers use now, and > > how > > > >>>> much will they use in the future? Our experience in HDFS was that > > > once > > > >>>> you start getting more than 100-200GB Java heap sizes, full GCs > > start > > > >>>> taking minutes to finish when using the standard JVMs. That alone > > is > > > a > > > >>>> good reason to go multi-process or consider storing more things > off > > > the > > > >>>> Java heap. > > > >>>> > > > >>> > > > >>> I see. Now I agree one-broker-per-disk should be more efficient in > > > terms > > > >> of > > > >>> GC since each broker probably needs less than 1/10 of the memory > > > >> available > > > >>> on a typical machine nowadays. I will remove this from the reason > of > > > >>> rejection. > > > >>> > > > >>> > > > >>>> > > > >>>> Disk failure is the "easy" case. The "hard" case, which is > > > >>>> unfortunately also the much more common case, is disk misbehavior. > > > >>>> Towards the end of their lives, disks tend to start slowing down > > > >>>> unpredictably. Requests that would have completed immediately > > before > > > >>>> start taking 20, 100 500 milliseconds. Some files may be readable > > and > > > >>>> other files may not be. System calls hang, sometimes forever, and > > the > > > >>>> Java process can't abort them, because the hang is in the kernel. > > It > > > >> is > > > >>>> not fun when threads are stuck in "D state" > > > >>>> http://stackoverflow.com/questions/20423521/process-perminan > > > >>>> tly-stuck-on-d-state > > > >>>> . Even kill -9 cannot abort the thread then. Fortunately, this > is > > > >>>> rare. > > > >>>> > > > >>> > > > >>> I agree it is a harder problem and it is rare. We probably don't > have > > > to > > > >>> worry about it in this KIP since this issue is orthogonal to > whether > > or > > > >> not > > > >>> we use JBOD. > > > >>> > > > >>> > > > >>>> > > > >>>> Another approach we should consider is for Kafka to implement its > > own > > > >>>> storage layer that would stripe across multiple disks. This > > wouldn't > > > >>>> have to be done at the block level, but could be done at the file > > > >> level. > > > >>>> We could use consistent hashing to determine which disks a file > > should > > > >>>> end up on, for example. > > > >>>> > > > >>> > > > >>> Are you suggesting that we should distribute log, or log segment, > > > across > > > >>> disks of brokers? I am not sure if I fully understand this > approach. > > My > > > >> gut > > > >>> feel is that this would be a drastic solution that would require > > > >>> non-trivial design. While this may be useful to Kafka, I would > prefer > > > not > > > >>> to discuss this in detail in this thread unless you believe it is > > > >> strictly > > > >>> superior to the design in this KIP in terms of solving our > use-case. > > > >>> > > > >>> > > > >>>> best, > > > >>>> Colin > > > >>>> > > > >>>>> > > > >>>>> Thanks, > > > >>>>> Dong > > > >>>>> > > > >>>>> On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe < > cmcc...@apache.org > > > > > > >>>>> wrote: > > > >>>>> > > > >>>>>> Hi Dong, > > > >>>>>> > > > >>>>>> Thanks for the writeup! It's very interesting. > > > >>>>>> > > > >>>>>> I apologize in advance if this has been discussed somewhere > else. > > > >>> But > > > >>>> I > > > >>>>>> am curious if you have considered the solution of running > multiple > > > >>>>>> brokers per node. Clearly there is a memory overhead with this > > > >>>> solution > > > >>>>>> because of the fixed cost of starting multiple JVMs. However, > > > >>> running > > > >>>>>> multiple JVMs would help avoid scalability bottlenecks. You > could > > > >>>>>> probably push more RPCs per second, for example. A garbage > > > >>> collection > > > >>>>>> in one broker would not affect the others. It would be > > interesting > > > >>> to > > > >>>>>> see this considered in the "alternate designs" design, even if > you > > > >>> end > > > >>>>>> up deciding it's not the way to go. > > > >>>>>> > > > >>>>>> best, > > > >>>>>> Colin > > > >>>>>> > > > >>>>>> > > > >>>>>> On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote: > > > >>>>>>> Hi all, > > > >>>>>>> > > > >>>>>>> We created KIP-112: Handle disk failure for JBOD. Please find > the > > > >>> KIP > > > >>>>>>> wiki > > > >>>>>>> in the link https://cwiki.apache.org/confl > > > >> uence/display/KAFKA/KIP- > > > >>>>>>> 112%3A+Handle+disk+failure+for+JBOD. > > > >>>>>>> > > > >>>>>>> This KIP is related to KIP-113 > > > >>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > >>>>>> 113%3A+Support+replicas+movement+between+log+directories>: > > > >>>>>>> Support replicas movement between log directories. They are > > > >> needed > > > >>> in > > > >>>>>>> order > > > >>>>>>> to support JBOD in Kafka. Please help review the KIP. You > > > >> feedback > > > >>> is > > > >>>>>>> appreciated! > > > >>>>>>> > > > >>>>>>> Thanks, > > > >>>>>>> Dong > > > >>>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > > > > -- > Grant Henke > Software Engineer | Cloudera > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke >