Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Dong Lin Wed, 01 Feb 2017 10:02:58 -0800

Hey Grant,

Yes, this KIP does exactly what you described:)


Thanks,
Dong

On Wed, Feb 1, 2017 at 9:45 AM, Grant Henke <ghe...@cloudera.com> wrote:

> Hi Dong,
>
> Thanks for putting this together.
>
> Since we are discussing alternative/simplified options. Have you considered
> handling the disk failures broker side to prevent a crash, marking the disk
> as "bad" to that individual broker, and continuing as normal? I imagine the
> broker would then fall out of sync for the replicas hosted on the bad disk
> and the ISR would shrink. This would allow people using min.isr to keep
> their data safe and the cluster operators would see a shrink in many ISRs
> and hopefully an obvious log message leading to a quick fix. I haven't
> thought through this idea in depth though. So there could be some
> shortfalls.
>
> Thanks,
> Grant
>
>
>
> On Wed, Feb 1, 2017 at 11:21 AM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Eno,
> >
> > Thanks much for the review.
> >
> > I think your suggestion is to split disks of a machine into multiple disk
> > sets and run one broker per disk set. Yeah this is similar to Colin's
> > suggestion of one-broker-per-disk, which we have evaluated at LinkedIn
> and
> > considered it to be a good short term approach.
> >
> > As of now I don't think any of these approach is a better alternative in
> > the long term. I will summarize these here. I have put these reasons in
> the
> > KIP's motivation section and rejected alternative section. I am happy to
> > discuss more and I would certainly like to use an alternative solution
> that
> > is easier to do with better performance.
> >
> > - JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2
> to
> > JBOD with replicatio-factor=3, we get 25% reduction in disk usage and
> > doubles the tolerance of broker failure before data unavailability from 1
> > to 2. This is pretty huge gain for any company that uses Kafka at large
> > scale.
> >
> > - JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is
> that
> > no major code change is needed in Kafka. Among the disadvantage of
> > one-broker-per-disk summarized in the KIP and previous email with Colin,
> > the biggest one is the 15% throughput loss compared to JBOD and less
> > flexibility to balance across disks. Further, it probably requires change
> > to internal deployment tools at various companies to deal with
> > one-broker-per-disk setup.
> >
> > - JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem
> is
> > that a broker becomes unavailable if any disk fail. Suppose
> > replication-factor=2 and there are 10 disks per machine. Then the
> > probability of of any message becomes unavailable due to disk failure
> with
> > RAID-0 is 100X higher than that with JBOD.
> >
> > - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere
> > between one-broker-per-disk and RAID-0. So it carries an averaged
> > disadvantages of these two approaches.
> >
> > To answer your question regarding, I think it is reasonable to mange disk
> > in Kafka. By "managing disks" we mean the management of assignment of
> > replicas across disks. Here are my reasons in more detail:
> >
> > - I don't think this KIP is a big step change. By allowing user to
> > configure Kafka to run multiple log directories or disks as of now, it is
> > implicit that Kafka manages disks. It is just not a complete feature.
> > Microsoft and probably other companies are using this feature under the
> > undesirable effect that a broker will fail any if any disk fail. It is
> good
> > to complete this feature.
> >
> > - I think it is reasonable to manage disk in Kafka. One of the most
> > important work that Kafka is doing is to determine the replica assignment
> > across brokers and make sure enough copies of a given replica is
> available.
> > I would argue that it is not much different than determining the replica
> > assignment across disk conceptually.
> >
> > - I would agree that this KIP is improve performance of Kafka at the cost
> > of more complexity inside Kafka, by switching from RAID-10 to JBOD. I
> would
> > argue that this is a right direction. If we can gain 20%+ performance by
> > managing NIC in Kafka as compared to existing approach and other
> > alternatives, I would say we should just do it. Such a gain in
> performance,
> > or equivalently reduction in cost, can save millions of dollars per year
> > for any company running Kafka at large scale.
> >
> > Thanks,
> > Dong
> >
> >
> > On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska <eno.there...@gmail.com>
> > wrote:
> >
> > > I'm coming somewhat late to the discussion, apologies for that.
> > >
> > > I'm worried about this proposal. It's moving Kafka to a world where it
> > > manages disks. So in a sense, the scope of the KIP is limited, but the
> > > direction it sets for Kafka is quite a big step change. Fundamentally
> > this
> > > is about balancing resources for a Kafka broker. This can be done by a
> > > tool, rather than by changing Kafka. E.g., the tool would take a bunch
> of
> > > disks together, create a volume over them and export that to a Kafka
> > broker
> > > (in addition to setting the memory limits for that broker or limiting
> > other
> > > resources). A different bunch of disks can then make up a second
> volume,
> > > and be used by another Kafka broker. This is aligned with what Colin is
> > > saying (as I understand it).
> > >
> > > Disks are not the only resource on a machine, there are several
> instances
> > > where multiple NICs are used for example. Do we want fine grained
> > > management of all these resources? I'd argue that opens us the system
> to
> > a
> > > lot of complexity.
> > >
> > > Thanks
> > > Eno
> > >
> > >
> > > > On 1 Feb 2017, at 01:53, Dong Lin <lindon...@gmail.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I am going to initiate the vote If there is no further concern with
> the
> > > KIP.
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > >
> > > > On Fri, Jan 27, 2017 at 8:08 PM, radai <radai.rosenbl...@gmail.com>
> > > wrote:
> > > >
> > > >> a few extra points:
> > > >>
> > > >> 1. broker per disk might also incur more client <--> broker sockets:
> > > >> suppose every producer / consumer "talks" to >1 partition, there's a
> > > very
> > > >> good chance that partitions that were co-located on a single 10-disk
> > > broker
> > > >> would now be split between several single-disk broker processes on
> the
> > > same
> > > >> machine. hard to put a multiplier on this, but likely >x1. sockets
> > are a
> > > >> limited resource at the OS level and incur some memory cost (kernel
> > > >> buffers)
> > > >>
> > > >> 2. there's a memory overhead to spinning up a JVM (compiled code and
> > > byte
> > > >> code objects etc). if we assume this overhead is ~300 MB (order of
> > > >> magnitude, specifics vary) than spinning up 10 JVMs would lose you 3
> > GB
> > > of
> > > >> RAM. not a ton, but non negligible.
> > > >>
> > > >> 3. there would also be some overhead downstream of kafka in any
> > > management
> > > >> / monitoring / log aggregation system. likely less than x10 though.
> > > >>
> > > >> 4. (related to above) - added complexity of administration with more
> > > >> running instances.
> > > >>
> > > >> is anyone running kafka with anywhere near 100GB heaps? i thought
> the
> > > point
> > > >> was to rely on kernel page cache to do the disk buffering ....
> > > >>
> > > >> On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > > >>
> > > >>> Hey Colin,
> > > >>>
> > > >>> Thanks much for the comment. Please see me comment inline.
> > > >>>
> > > >>> On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe <cmcc...@apache.org
> >
> > > >> wrote:
> > > >>>
> > > >>>> On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> > > >>>>> Hey Colin,
> > > >>>>>
> > > >>>>> Good point! Yeah we have actually considered and tested this
> > > >> solution,
> > > >>>>> which we call one-broker-per-disk. It would work and should
> require
> > > >> no
> > > >>>>> major change in Kafka as compared to this JBOD KIP. So it would
> be
> > a
> > > >>> good
> > > >>>>> short term solution.
> > > >>>>>
> > > >>>>> But it has a few drawbacks which makes it less desirable in the
> > long
> > > >>>>> term.
> > > >>>>> Assume we have 10 disks on a machine. Here are the problems:
> > > >>>>
> > > >>>> Hi Dong,
> > > >>>>
> > > >>>> Thanks for the thoughtful reply.
> > > >>>>
> > > >>>>>
> > > >>>>> 1) Our stress test result shows that one-broker-per-disk has 15%
> > > >> lower
> > > >>>>> throughput
> > > >>>>>
> > > >>>>> 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> > > >>>>> MetadataUpdateRequest and StopReplicaRequest. This increases the
> > > >> burden
> > > >>>>> on
> > > >>>>> controller which can be the performance bottleneck.
> > > >>>>
> > > >>>> Maybe I'm misunderstanding something, but there would not be 10x
> as
> > > >> many
> > > >>>> StopReplicaRequest RPCs, would there?  The other requests would
> > > >> increase
> > > >>>> 10x, but from a pretty low base, right?  We are not reassigning
> > > >>>> partitions all the time, I hope (or else we have bigger
> problems...)
> > > >>>>
> > > >>>
> > > >>> I think the controller will group StopReplicaRequest per broker and
> > > send
> > > >>> only one StopReplicaRequest to a broker during controlled shutdown.
> > > >> Anyway,
> > > >>> we don't have to worry about this if we agree that other requests
> > will
> > > >>> increase by 10X. One MetadataRequest to send to each broker in the
> > > >> cluster
> > > >>> every time there is leadership change. I am not sure this is a real
> > > >>> problem. But in theory this makes the overhead complexity O(number
> of
> > > >>> broker) and may be a concern in the future. Ideally we should avoid
> > it.
> > > >>>
> > > >>>
> > > >>>>
> > > >>>>>
> > > >>>>> 3) Less efficient use of physical resource on the machine. The
> > number
> > > >>> of
> > > >>>>> socket on each machine will increase by 10X. The number of
> > connection
> > > >>>>> between any two machine will increase by 100X.
> > > >>>>>
> > > >>>>> 4) Less efficient way to management memory and quota.
> > > >>>>>
> > > >>>>> 5) Rebalance between disks/brokers on the same machine will less
> > > >>>>> efficient
> > > >>>>> and less flexible. Broker has to read data from another broker on
> > the
> > > >>>>> same
> > > >>>>> machine via socket. It is also harder to do automatic load
> balance
> > > >>>>> between
> > > >>>>> disks on the same machine in the future.
> > > >>>>>
> > > >>>>> I will put this and the explanation in the rejected alternative
> > > >>> section.
> > > >>>>> I
> > > >>>>> have a few questions:
> > > >>>>>
> > > >>>>> - Can you explain why this solution can help avoid scalability
> > > >>>>> bottleneck?
> > > >>>>> I actually think it will exacerbate the scalability problem due
> the
> > > >> 2)
> > > >>>>> above.
> > > >>>>> - Why can we push more RPC with this solution?
> > > >>>>
> > > >>>> To really answer this question we'd have to take a deep dive into
> > the
> > > >>>> locking of the broker and figure out how effectively it can
> > > parallelize
> > > >>>> truly independent requests.  Almost every multithreaded process is
> > > >> going
> > > >>>> to have shared state, like shared queues or shared sockets, that
> is
> > > >>>> going to make scaling less than linear when you add disks or
> > > >> processors.
> > > >>>> (And clearly, another option is to improve that scalability,
> rather
> > > >>>> than going multi-process!)
> > > >>>>
> > > >>>
> > > >>> Yeah I also think it is better to improve scalability inside kafka
> > code
> > > >> if
> > > >>> possible. I am not sure we currently have any scalability issue
> > inside
> > > >>> Kafka that can not be removed without using multi-process.
> > > >>>
> > > >>>
> > > >>>>
> > > >>>>> - It is true that a garbage collection in one broker would not
> > affect
> > > >>>>> others. But that is after every broker only uses 1/10 of the
> > memory.
> > > >>> Can
> > > >>>>> we be sure that this will actually help performance?
> > > >>>>
> > > >>>> The big question is, how much memory do Kafka brokers use now, and
> > how
> > > >>>> much will they use in the future?  Our experience in HDFS was that
> > > once
> > > >>>> you start getting more than 100-200GB Java heap sizes, full GCs
> > start
> > > >>>> taking minutes to finish when using the standard JVMs.  That alone
> > is
> > > a
> > > >>>> good reason to go multi-process or consider storing more things
> off
> > > the
> > > >>>> Java heap.
> > > >>>>
> > > >>>
> > > >>> I see. Now I agree one-broker-per-disk should be more efficient in
> > > terms
> > > >> of
> > > >>> GC since each broker probably needs less than 1/10 of the memory
> > > >> available
> > > >>> on a typical machine nowadays. I will remove this from the reason
> of
> > > >>> rejection.
> > > >>>
> > > >>>
> > > >>>>
> > > >>>> Disk failure is the "easy" case.  The "hard" case, which is
> > > >>>> unfortunately also the much more common case, is disk misbehavior.
> > > >>>> Towards the end of their lives, disks tend to start slowing down
> > > >>>> unpredictably.  Requests that would have completed immediately
> > before
> > > >>>> start taking 20, 100 500 milliseconds.  Some files may be readable
> > and
> > > >>>> other files may not be.  System calls hang, sometimes forever, and
> > the
> > > >>>> Java process can't abort them, because the hang is in the kernel.
> > It
> > > >> is
> > > >>>> not fun when threads are stuck in "D state"
> > > >>>> http://stackoverflow.com/questions/20423521/process-perminan
> > > >>>> tly-stuck-on-d-state
> > > >>>> .  Even kill -9 cannot abort the thread then.  Fortunately, this
> is
> > > >>>> rare.
> > > >>>>
> > > >>>
> > > >>> I agree it is a harder problem and it is rare. We probably don't
> have
> > > to
> > > >>> worry about it in this KIP since this issue is orthogonal to
> whether
> > or
> > > >> not
> > > >>> we use JBOD.
> > > >>>
> > > >>>
> > > >>>>
> > > >>>> Another approach we should consider is for Kafka to implement its
> > own
> > > >>>> storage layer that would stripe across multiple disks.  This
> > wouldn't
> > > >>>> have to be done at the block level, but could be done at the file
> > > >> level.
> > > >>>> We could use consistent hashing to determine which disks a file
> > should
> > > >>>> end up on, for example.
> > > >>>>
> > > >>>
> > > >>> Are you suggesting that we should distribute log, or log segment,
> > > across
> > > >>> disks of brokers? I am not sure if I fully understand this
> approach.
> > My
> > > >> gut
> > > >>> feel is that this would be a drastic solution that would require
> > > >>> non-trivial design. While this may be useful to Kafka, I would
> prefer
> > > not
> > > >>> to discuss this in detail in this thread unless you believe it is
> > > >> strictly
> > > >>> superior to the design in this KIP in terms of solving our
> use-case.
> > > >>>
> > > >>>
> > > >>>> best,
> > > >>>> Colin
> > > >>>>
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Dong
> > > >>>>>
> > > >>>>> On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe <
> cmcc...@apache.org
> > >
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi Dong,
> > > >>>>>>
> > > >>>>>> Thanks for the writeup!  It's very interesting.
> > > >>>>>>
> > > >>>>>> I apologize in advance if this has been discussed somewhere
> else.
> > > >>> But
> > > >>>> I
> > > >>>>>> am curious if you have considered the solution of running
> multiple
> > > >>>>>> brokers per node.  Clearly there is a memory overhead with this
> > > >>>> solution
> > > >>>>>> because of the fixed cost of starting multiple JVMs.  However,
> > > >>> running
> > > >>>>>> multiple JVMs would help avoid scalability bottlenecks.  You
> could
> > > >>>>>> probably push more RPCs per second, for example.  A garbage
> > > >>> collection
> > > >>>>>> in one broker would not affect the others.  It would be
> > interesting
> > > >>> to
> > > >>>>>> see this considered in the "alternate designs" design, even if
> you
> > > >>> end
> > > >>>>>> up deciding it's not the way to go.
> > > >>>>>>
> > > >>>>>> best,
> > > >>>>>> Colin
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote:
> > > >>>>>>> Hi all,
> > > >>>>>>>
> > > >>>>>>> We created KIP-112: Handle disk failure for JBOD. Please find
> the
> > > >>> KIP
> > > >>>>>>> wiki
> > > >>>>>>> in the link https://cwiki.apache.org/confl
> > > >> uence/display/KAFKA/KIP-
> > > >>>>>>> 112%3A+Handle+disk+failure+for+JBOD.
> > > >>>>>>>
> > > >>>>>>> This KIP is related to KIP-113
> > > >>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >>>>>> 113%3A+Support+replicas+movement+between+log+directories>:
> > > >>>>>>> Support replicas movement between log directories. They are
> > > >> needed
> > > >>> in
> > > >>>>>>> order
> > > >>>>>>> to support JBOD in Kafka. Please help review the KIP. You
> > > >> feedback
> > > >>> is
> > > >>>>>>> appreciated!
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> Dong
> > > >>>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to