Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Colin McCabe Thu, 09 Feb 2017 15:37:54 -0800

On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
> Thanks for all the comments Colin!
> 
> To answer your questions:
> - Yes, a broker will shutdown if all its log directories are bad.


That makes sense.  Can you add this to the writeup?

> - I updated the KIP to explicitly state that a log directory will be
> assumed to be good until broker sees IOException when it tries to access
> the log directory.

Thanks.

> - Controller doesn't explicitly know whether there is new log directory
> or
> not. All controller knows is whether replicas are online or offline based
> on LeaderAndIsrResponse. According to the existing Kafka implementation,
> controller will always send LeaderAndIsrRequest to a broker after it
> bounces.

I thought so.  It's good to clarify, though.  Do you think it's worth
adding a quick discussion of this on the wiki?

best,
Colin

> 
> Please see this
> <https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=67638402&selectedPageVersions=9&selectedPageVersions=10>
> for the change of the KIP.
> 
> On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe <[email protected]> wrote:
> 
> > On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
> > > Thanks, Dong L.
> > >
> > > Do we plan on bringing down the broker process when all log directories
> > > are offline?
> > >
> > > Can you explicitly state on the KIP that the log dirs are all considered
> > > good after the broker process is bounced?  It seems like an important
> > > thing to be clear about.  Also, perhaps discuss how the controller
> > > becomes aware of the newly good log directories after a broker bounce
> > > (and whether this triggers re-election).
> >
> > I meant to write, all the log dirs where the broker can still read the
> > index and some other files.  Clearly, log dirs that are completely
> > inaccessible will still be considered bad after a broker process bounce.
> >
> > best,
> > Colin
> >
> > >
> > > +1 (non-binding) aside from that
> > >
> > >
> > >
> > > On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
> > > > Hi all,
> > > >
> > > > Thank you all for the helpful suggestion. I have updated the KIP to
> > > > address
> > > > the comments received so far. See here
> > > > <https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?
> > pageId=67638402&selectedPageVersions=8&selectedPageVersions=9>to
> > > > read the changes of the KIP. Here is a summary of change:
> > > >
> > > > - Updated the Proposed Change section to change the recovery steps.
> > After
> > > > this change, broker will also create replica as long as all log
> > > > directories
> > > > are working.
> > > > - Removed kafka-log-dirs.sh from this KIP since user no longer needs to
> > > > use
> > > > it for recovery from bad disks.
> > > > - Explained how the znode controller_managed_state is managed in the
> > > > Public
> > > > interface section.
> > > > - Explained what happens during controller failover, partition
> > > > reassignment
> > > > and topic deletion in the Proposed Change section.
> > > > - Updated Future Work section to include the following potential
> > > > improvements
> > > >   - Let broker notify controller of ISR change and disk state change
> > via
> > > > RPC instead of using zookeeper
> > > >   - Handle various failure scenarios (e.g. slow disk) on a case-by-case
> > > > basis. For example, we may want to detect slow disk and consider it as
> > > > offline.
> > > >   - Allow admin to mark a directory as bad so that it will not be used.
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > >
> > > >
> > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin <[email protected]> wrote:
> > > >
> > > > > Hey Eno,
> > > > >
> > > > > Thanks much for the comment!
> > > > >
> > > > > I still think the complexity added to Kafka is justified by its
> > benefit.
> > > > > Let me provide my reasons below.
> > > > >
> > > > > 1) The additional logic is easy to understand and thus its complexity
> > > > > should be reasonable.
> > > > >
> > > > > On the broker side, it needs to catch exception when access log
> > directory,
> > > > > mark log directory and all its replicas as offline, notify
> > controller by
> > > > > writing the zookeeper notification path, and specify error in
> > > > > LeaderAndIsrResponse. On the controller side, it will listener to
> > > > > zookeeper for disk failure notification, learn about offline
> > replicas in
> > > > > the LeaderAndIsrResponse, and take offline replicas into
> > consideration when
> > > > > electing leaders. It also mark replica as created in zookeeper and
> > use it
> > > > > to determine whether a replica is created.
> > > > >
> > > > > That is all the logic we need to add in Kafka. I personally feel
> > this is
> > > > > easy to reason about.
> > > > >
> > > > > 2) The additional code is not much.
> > > > >
> > > > > I expect the code for KIP-112 to be around 1100 lines new code.
> > Previously
> > > > > I have implemented a prototype of a slightly different design (see
> > here
> > > > > <https://docs.google.com/document/d/1Izza0SBmZMVUBUt9s_
> > -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>)
> > > > > and uploaded it to github (see here
> > > > > <https://github.com/lindong28/kafka/tree/JBOD>). The patch changed
> > 33
> > > > > files, added 1185 lines and deleted 183 lines. The size of prototype
> > patch
> > > > > is actually smaller than patch of KIP-107 (see here
> > > > > <https://github.com/apache/kafka/pull/2476>) which is already
> > accepted.
> > > > > The KIP-107 patch changed 49 files, added 1349 lines and deleted 141
> > lines.
> > > > >
> > > > > 3) Comparison with one-broker-per-multiple-volumes
> > > > >
> > > > > This KIP can improve the availability of Kafka in this case such
> > that one
> > > > > failed volume doesn't bring down the entire broker.
> > > > >
> > > > > 4) Comparison with one-broker-per-volume
> > > > >
> > > > > If each volume maps to multiple disks, then we still have similar
> > problem
> > > > > such that the broker will fail if any disk of the volume failed.
> > > > >
> > > > > If each volume maps to one disk, it means that we need to deploy 10
> > > > > brokers on a machine if the machine has 10 disks. I will explain the
> > > > > concern with this approach in order of their importance.
> > > > >
> > > > > - It is weird if we were to tell kafka user to deploy 50 brokers on a
> > > > > machine of 50 disks.
> > > > >
> > > > > - Either when user deploys Kafka on a commercial cloud platform or
> > when
> > > > > user deploys their own cluster, the size or largest disk is usually
> > > > > limited. There will be scenarios where user want to increase broker
> > > > > capacity by having multiple disks per broker. This JBOD KIP makes it
> > > > > feasible without hurting availability due to single disk failure.
> > > > >
> > > > > - Automatic load rebalance across disks will be easier and more
> > flexible
> > > > > if one broker has multiple disks. This can be future work.
> > > > >
> > > > > - There is performance concern when you deploy 10 broker vs. 1
> > broker on
> > > > > one machine. The metadata the cluster, including FetchRequest,
> > > > > ProduceResponse, MetadataRequest and so on will all be 10X more. The
> > > > > packet-per-second will be 10X higher which may limit performance if
> > pps is
> > > > > the performance bottleneck. The number of socket on the machine is
> > 10X
> > > > > higher. And the number of replication thread will be 100X more. The
> > impact
> > > > > will be more significant with increasing number of disks per
> > machine. Thus
> > > > > it will limit Kakfa's scalability in the long term.
> > > > >
> > > > > Thanks,
> > > > > Dong
> > > > >
> > > > >
> > > > > On Tue, Feb 7, 2017 at 1:51 AM, Eno Thereska <[email protected]
> > >
> > > > > wrote:
> > > > >
> > > > >> Hi Dong,
> > > > >>
> > > > >> To simplify the discussion today, on my part I'll zoom into one
> > thing
> > > > >> only:
> > > > >>
> > > > >> - I'll discuss the options called below : "one-broker-per-disk" or
> > > > >> "one-broker-per-few-disks".
> > > > >>
> > > > >> - I completely buy the JBOD vs RAID arguments so there is no need to
> > > > >> discuss that part for me. I buy it that JBODs are good.
> > > > >>
> > > > >> I find the terminology can be improved a bit. Ideally we'd be
> > talking
> > > > >> about volumes, not disks. Just to make it clear that Kafka
> > understand
> > > > >> volumes/directories, not individual raw disks. So by
> > > > >> "one-broker-per-few-disks" what I mean is that the admin can pool a
> > few
> > > > >> disks together to create a volume/directory and give that to Kafka.
> > > > >>
> > > > >>
> > > > >> The kernel of my question will be that the admin already has tools
> > to 1)
> > > > >> create volumes/directories from a JBOD and 2) start a broker on a
> > desired
> > > > >> machine and 3) assign a broker resources like a directory. I claim
> > that
> > > > >> those tools are sufficient to optimise resource allocation.  I
> > understand
> > > > >> that a broker could manage point 3) itself, ie juggle the
> > directories. My
> > > > >> question is whether the complexity added to Kafka is justified.
> > > > >> Operationally it seems to me an admin will still have to do all the
> > three
> > > > >> items above.
> > > > >>
> > > > >> Looking forward to the discussion
> > > > >> Thanks
> > > > >> Eno
> > > > >>
> > > > >>
> > > > >> > On 1 Feb 2017, at 17:21, Dong Lin <[email protected]> wrote:
> > > > >> >
> > > > >> > Hey Eno,
> > > > >> >
> > > > >> > Thanks much for the review.
> > > > >> >
> > > > >> > I think your suggestion is to split disks of a machine into
> > multiple
> > > > >> disk
> > > > >> > sets and run one broker per disk set. Yeah this is similar to
> > Colin's
> > > > >> > suggestion of one-broker-per-disk, which we have evaluated at
> > LinkedIn
> > > > >> and
> > > > >> > considered it to be a good short term approach.
> > > > >> >
> > > > >> > As of now I don't think any of these approach is a better
> > alternative in
> > > > >> > the long term. I will summarize these here. I have put these
> > reasons in
> > > > >> the
> > > > >> > KIP's motivation section and rejected alternative section. I am
> > happy to
> > > > >> > discuss more and I would certainly like to use an alternative
> > solution
> > > > >> that
> > > > >> > is easier to do with better performance.
> > > > >> >
> > > > >> > - JBOD vs. RAID-10: if we switch from RAID-10 with
> > > > >> replication-factoer=2 to
> > > > >> > JBOD with replicatio-factor=3, we get 25% reduction in disk usage
> > and
> > > > >> > doubles the tolerance of broker failure before data
> > unavailability from
> > > > >> 1
> > > > >> > to 2. This is pretty huge gain for any company that uses Kafka at
> > large
> > > > >> > scale.
> > > > >> >
> > > > >> > - JBOD vs. one-broker-per-disk: The benefit of
> > one-broker-per-disk is
> > > > >> that
> > > > >> > no major code change is needed in Kafka. Among the disadvantage of
> > > > >> > one-broker-per-disk summarized in the KIP and previous email with
> > Colin,
> > > > >> > the biggest one is the 15% throughput loss compared to JBOD and
> > less
> > > > >> > flexibility to balance across disks. Further, it probably requires
> > > > >> change
> > > > >> > to internal deployment tools at various companies to deal with
> > > > >> > one-broker-per-disk setup.
> > > > >> >
> > > > >> > - JBOD vs. RAID-0: This is the setup that used at Microsoft. The
> > > > >> problem is
> > > > >> > that a broker becomes unavailable if any disk fail. Suppose
> > > > >> > replication-factor=2 and there are 10 disks per machine. Then the
> > > > >> > probability of of any message becomes unavailable due to disk
> > failure
> > > > >> with
> > > > >> > RAID-0 is 100X higher than that with JBOD.
> > > > >> >
> > > > >> > - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is
> > > > >> somewhere
> > > > >> > between one-broker-per-disk and RAID-0. So it carries an averaged
> > > > >> > disadvantages of these two approaches.
> > > > >> >
> > > > >> > To answer your question regarding, I think it is reasonable to
> > mange
> > > > >> disk
> > > > >> > in Kafka. By "managing disks" we mean the management of
> > assignment of
> > > > >> > replicas across disks. Here are my reasons in more detail:
> > > > >> >
> > > > >> > - I don't think this KIP is a big step change. By allowing user to
> > > > >> > configure Kafka to run multiple log directories or disks as of
> > now, it
> > > > >> is
> > > > >> > implicit that Kafka manages disks. It is just not a complete
> > feature.
> > > > >> > Microsoft and probably other companies are using this feature
> > under the
> > > > >> > undesirable effect that a broker will fail any if any disk fail.
> > It is
> > > > >> good
> > > > >> > to complete this feature.
> > > > >> >
> > > > >> > - I think it is reasonable to manage disk in Kafka. One of the
> > most
> > > > >> > important work that Kafka is doing is to determine the replica
> > > > >> assignment
> > > > >> > across brokers and make sure enough copies of a given replica is
> > > > >> available.
> > > > >> > I would argue that it is not much different than determining the
> > replica
> > > > >> > assignment across disk conceptually.
> > > > >> >
> > > > >> > - I would agree that this KIP is improve performance of Kafka at
> > the
> > > > >> cost
> > > > >> > of more complexity inside Kafka, by switching from RAID-10 to
> > JBOD. I
> > > > >> would
> > > > >> > argue that this is a right direction. If we can gain 20%+
> > performance by
> > > > >> > managing NIC in Kafka as compared to existing approach and other
> > > > >> > alternatives, I would say we should just do it. Such a gain in
> > > > >> performance,
> > > > >> > or equivalently reduction in cost, can save millions of dollars
> > per year
> > > > >> > for any company running Kafka at large scale.
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Dong
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska <
> > [email protected]>
> > > > >> wrote:
> > > > >> >
> > > > >> >> I'm coming somewhat late to the discussion, apologies for that.
> > > > >> >>
> > > > >> >> I'm worried about this proposal. It's moving Kafka to a world
> > where it
> > > > >> >> manages disks. So in a sense, the scope of the KIP is limited,
> > but the
> > > > >> >> direction it sets for Kafka is quite a big step change.
> > Fundamentally
> > > > >> this
> > > > >> >> is about balancing resources for a Kafka broker. This can be
> > done by a
> > > > >> >> tool, rather than by changing Kafka. E.g., the tool would take a
> > bunch
> > > > >> of
> > > > >> >> disks together, create a volume over them and export that to a
> > Kafka
> > > > >> broker
> > > > >> >> (in addition to setting the memory limits for that broker or
> > limiting
> > > > >> other
> > > > >> >> resources). A different bunch of disks can then make up a second
> > > > >> volume,
> > > > >> >> and be used by another Kafka broker. This is aligned with what
> > Colin is
> > > > >> >> saying (as I understand it).
> > > > >> >>
> > > > >> >> Disks are not the only resource on a machine, there are several
> > > > >> instances
> > > > >> >> where multiple NICs are used for example. Do we want fine grained
> > > > >> >> management of all these resources? I'd argue that opens us the
> > system
> > > > >> to a
> > > > >> >> lot of complexity.
> > > > >> >>
> > > > >> >> Thanks
> > > > >> >> Eno
> > > > >> >>
> > > > >> >>
> > > > >> >>> On 1 Feb 2017, at 01:53, Dong Lin <[email protected]> wrote:
> > > > >> >>>
> > > > >> >>> Hi all,
> > > > >> >>>
> > > > >> >>> I am going to initiate the vote If there is no further concern
> > with
> > > > >> the
> > > > >> >> KIP.
> > > > >> >>>
> > > > >> >>> Thanks,
> > > > >> >>> Dong
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> On Fri, Jan 27, 2017 at 8:08 PM, radai <
> > [email protected]>
> > > > >> >> wrote:
> > > > >> >>>
> > > > >> >>>> a few extra points:
> > > > >> >>>>
> > > > >> >>>> 1. broker per disk might also incur more client <--> broker
> > sockets:
> > > > >> >>>> suppose every producer / consumer "talks" to >1 partition,
> > there's a
> > > > >> >> very
> > > > >> >>>> good chance that partitions that were co-located on a single
> > 10-disk
> > > > >> >> broker
> > > > >> >>>> would now be split between several single-disk broker
> > processes on
> > > > >> the
> > > > >> >> same
> > > > >> >>>> machine. hard to put a multiplier on this, but likely >x1.
> > sockets
> > > > >> are a
> > > > >> >>>> limited resource at the OS level and incur some memory cost
> > (kernel
> > > > >> >>>> buffers)
> > > > >> >>>>
> > > > >> >>>> 2. there's a memory overhead to spinning up a JVM (compiled
> > code and
> > > > >> >> byte
> > > > >> >>>> code objects etc). if we assume this overhead is ~300 MB
> > (order of
> > > > >> >>>> magnitude, specifics vary) than spinning up 10 JVMs would lose
> > you 3
> > > > >> GB
> > > > >> >> of
> > > > >> >>>> RAM. not a ton, but non negligible.
> > > > >> >>>>
> > > > >> >>>> 3. there would also be some overhead downstream of kafka in any
> > > > >> >> management
> > > > >> >>>> / monitoring / log aggregation system. likely less than x10
> > though.
> > > > >> >>>>
> > > > >> >>>> 4. (related to above) - added complexity of administration
> > with more
> > > > >> >>>> running instances.
> > > > >> >>>>
> > > > >> >>>> is anyone running kafka with anywhere near 100GB heaps? i
> > thought the
> > > > >> >> point
> > > > >> >>>> was to rely on kernel page cache to do the disk buffering ....
> > > > >> >>>>
> > > > >> >>>> On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin <
> > [email protected]>
> > > > >> wrote:
> > > > >> >>>>
> > > > >> >>>>> Hey Colin,
> > > > >> >>>>>
> > > > >> >>>>> Thanks much for the comment. Please see me comment inline.
> > > > >> >>>>>
> > > > >> >>>>> On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe <
> > [email protected]>
> > > > >> >>>> wrote:
> > > > >> >>>>>
> > > > >> >>>>>> On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> > > > >> >>>>>>> Hey Colin,
> > > > >> >>>>>>>
> > > > >> >>>>>>> Good point! Yeah we have actually considered and tested this
> > > > >> >>>> solution,
> > > > >> >>>>>>> which we call one-broker-per-disk. It would work and should
> > > > >> require
> > > > >> >>>> no
> > > > >> >>>>>>> major change in Kafka as compared to this JBOD KIP. So it
> > would
> > > > >> be a
> > > > >> >>>>> good
> > > > >> >>>>>>> short term solution.
> > > > >> >>>>>>>
> > > > >> >>>>>>> But it has a few drawbacks which makes it less desirable in
> > the
> > > > >> long
> > > > >> >>>>>>> term.
> > > > >> >>>>>>> Assume we have 10 disks on a machine. Here are the problems:
> > > > >> >>>>>>
> > > > >> >>>>>> Hi Dong,
> > > > >> >>>>>>
> > > > >> >>>>>> Thanks for the thoughtful reply.
> > > > >> >>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> 1) Our stress test result shows that one-broker-per-disk
> > has 15%
> > > > >> >>>> lower
> > > > >> >>>>>>> throughput
> > > > >> >>>>>>>
> > > > >> >>>>>>> 2) Controller would need to send 10X as many
> > LeaderAndIsrRequest,
> > > > >> >>>>>>> MetadataUpdateRequest and StopReplicaRequest. This
> > increases the
> > > > >> >>>> burden
> > > > >> >>>>>>> on
> > > > >> >>>>>>> controller which can be the performance bottleneck.
> > > > >> >>>>>>
> > > > >> >>>>>> Maybe I'm misunderstanding something, but there would not be
> > 10x as
> > > > >> >>>> many
> > > > >> >>>>>> StopReplicaRequest RPCs, would there?  The other requests
> > would
> > > > >> >>>> increase
> > > > >> >>>>>> 10x, but from a pretty low base, right?  We are not
> > reassigning
> > > > >> >>>>>> partitions all the time, I hope (or else we have bigger
> > > > >> problems...)
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>> I think the controller will group StopReplicaRequest per
> > broker and
> > > > >> >> send
> > > > >> >>>>> only one StopReplicaRequest to a broker during controlled
> > shutdown.
> > > > >> >>>> Anyway,
> > > > >> >>>>> we don't have to worry about this if we agree that other
> > requests
> > > > >> will
> > > > >> >>>>> increase by 10X. One MetadataRequest to send to each broker
> > in the
> > > > >> >>>> cluster
> > > > >> >>>>> every time there is leadership change. I am not sure this is
> > a real
> > > > >> >>>>> problem. But in theory this makes the overhead complexity
> > O(number
> > > > >> of
> > > > >> >>>>> broker) and may be a concern in the future. Ideally we should
> > avoid
> > > > >> it.
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> 3) Less efficient use of physical resource on the machine.
> > The
> > > > >> number
> > > > >> >>>>> of
> > > > >> >>>>>>> socket on each machine will increase by 10X. The number of
> > > > >> connection
> > > > >> >>>>>>> between any two machine will increase by 100X.
> > > > >> >>>>>>>
> > > > >> >>>>>>> 4) Less efficient way to management memory and quota.
> > > > >> >>>>>>>
> > > > >> >>>>>>> 5) Rebalance between disks/brokers on the same machine will
> > less
> > > > >> >>>>>>> efficient
> > > > >> >>>>>>> and less flexible. Broker has to read data from another
> > broker on
> > > > >> the
> > > > >> >>>>>>> same
> > > > >> >>>>>>> machine via socket. It is also harder to do automatic load
> > balance
> > > > >> >>>>>>> between
> > > > >> >>>>>>> disks on the same machine in the future.
> > > > >> >>>>>>>
> > > > >> >>>>>>> I will put this and the explanation in the rejected
> > alternative
> > > > >> >>>>> section.
> > > > >> >>>>>>> I
> > > > >> >>>>>>> have a few questions:
> > > > >> >>>>>>>
> > > > >> >>>>>>> - Can you explain why this solution can help avoid
> > scalability
> > > > >> >>>>>>> bottleneck?
> > > > >> >>>>>>> I actually think it will exacerbate the scalability problem
> > due
> > > > >> the
> > > > >> >>>> 2)
> > > > >> >>>>>>> above.
> > > > >> >>>>>>> - Why can we push more RPC with this solution?
> > > > >> >>>>>>
> > > > >> >>>>>> To really answer this question we'd have to take a deep dive
> > into
> > > > >> the
> > > > >> >>>>>> locking of the broker and figure out how effectively it can
> > > > >> >> parallelize
> > > > >> >>>>>> truly independent requests.  Almost every multithreaded
> > process is
> > > > >> >>>> going
> > > > >> >>>>>> to have shared state, like shared queues or shared sockets,
> > that is
> > > > >> >>>>>> going to make scaling less than linear when you add disks or
> > > > >> >>>> processors.
> > > > >> >>>>>> (And clearly, another option is to improve that scalability,
> > rather
> > > > >> >>>>>> than going multi-process!)
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>> Yeah I also think it is better to improve scalability inside
> > kafka
> > > > >> code
> > > > >> >>>> if
> > > > >> >>>>> possible. I am not sure we currently have any scalability
> > issue
> > > > >> inside
> > > > >> >>>>> Kafka that can not be removed without using multi-process.
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>>>
> > > > >> >>>>>>> - It is true that a garbage collection in one broker would
> > not
> > > > >> affect
> > > > >> >>>>>>> others. But that is after every broker only uses 1/10 of the
> > > > >> memory.
> > > > >> >>>>> Can
> > > > >> >>>>>>> we be sure that this will actually help performance?
> > > > >> >>>>>>
> > > > >> >>>>>> The big question is, how much memory do Kafka brokers use
> > now, and
> > > > >> how
> > > > >> >>>>>> much will they use in the future?  Our experience in HDFS
> > was that
> > > > >> >> once
> > > > >> >>>>>> you start getting more than 100-200GB Java heap sizes, full
> > GCs
> > > > >> start
> > > > >> >>>>>> taking minutes to finish when using the standard JVMs.  That
> > alone
> > > > >> is
> > > > >> >> a
> > > > >> >>>>>> good reason to go multi-process or consider storing more
> > things off
> > > > >> >> the
> > > > >> >>>>>> Java heap.
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>> I see. Now I agree one-broker-per-disk should be more
> > efficient in
> > > > >> >> terms
> > > > >> >>>> of
> > > > >> >>>>> GC since each broker probably needs less than 1/10 of the
> > memory
> > > > >> >>>> available
> > > > >> >>>>> on a typical machine nowadays. I will remove this from the
> > reason of
> > > > >> >>>>> rejection.
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>>>
> > > > >> >>>>>> Disk failure is the "easy" case.  The "hard" case, which is
> > > > >> >>>>>> unfortunately also the much more common case, is disk
> > misbehavior.
> > > > >> >>>>>> Towards the end of their lives, disks tend to start slowing
> > down
> > > > >> >>>>>> unpredictably.  Requests that would have completed
> > immediately
> > > > >> before
> > > > >> >>>>>> start taking 20, 100 500 milliseconds.  Some files may be
> > readable
> > > > >> and
> > > > >> >>>>>> other files may not be.  System calls hang, sometimes
> > forever, and
> > > > >> the
> > > > >> >>>>>> Java process can't abort them, because the hang is in the
> > kernel.
> > > > >> It
> > > > >> >>>> is
> > > > >> >>>>>> not fun when threads are stuck in "D state"
> > > > >> >>>>>> http://stackoverflow.com/questions/20423521/process-perminan
> > > > >> >>>>>> tly-stuck-on-d-state
> > > > >> >>>>>> .  Even kill -9 cannot abort the thread then.  Fortunately,
> > this is
> > > > >> >>>>>> rare.
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>> I agree it is a harder problem and it is rare. We probably
> > don't
> > > > >> have
> > > > >> >> to
> > > > >> >>>>> worry about it in this KIP since this issue is orthogonal to
> > > > >> whether or
> > > > >> >>>> not
> > > > >> >>>>> we use JBOD.
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>>>
> > > > >> >>>>>> Another approach we should consider is for Kafka to
> > implement its
> > > > >> own
> > > > >> >>>>>> storage layer that would stripe across multiple disks.  This
> > > > >> wouldn't
> > > > >> >>>>>> have to be done at the block level, but could be done at the
> > file
> > > > >> >>>> level.
> > > > >> >>>>>> We could use consistent hashing to determine which disks a
> > file
> > > > >> should
> > > > >> >>>>>> end up on, for example.
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>> Are you suggesting that we should distribute log, or log
> > segment,
> > > > >> >> across
> > > > >> >>>>> disks of brokers? I am not sure if I fully understand this
> > > > >> approach. My
> > > > >> >>>> gut
> > > > >> >>>>> feel is that this would be a drastic solution that would
> > require
> > > > >> >>>>> non-trivial design. While this may be useful to Kafka, I would
> > > > >> prefer
> > > > >> >> not
> > > > >> >>>>> to discuss this in detail in this thread unless you believe
> > it is
> > > > >> >>>> strictly
> > > > >> >>>>> superior to the design in this KIP in terms of solving our
> > use-case.
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>>> best,
> > > > >> >>>>>> Colin
> > > > >> >>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> Thanks,
> > > > >> >>>>>>> Dong
> > > > >> >>>>>>>
> > > > >> >>>>>>> On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe <
> > > > >> [email protected]>
> > > > >> >>>>>>> wrote:
> > > > >> >>>>>>>
> > > > >> >>>>>>>> Hi Dong,
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> Thanks for the writeup!  It's very interesting.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> I apologize in advance if this has been discussed
> > somewhere else.
> > > > >> >>>>> But
> > > > >> >>>>>> I
> > > > >> >>>>>>>> am curious if you have considered the solution of running
> > > > >> multiple
> > > > >> >>>>>>>> brokers per node.  Clearly there is a memory overhead with
> > this
> > > > >> >>>>>> solution
> > > > >> >>>>>>>> because of the fixed cost of starting multiple JVMs.
> > However,
> > > > >> >>>>> running
> > > > >> >>>>>>>> multiple JVMs would help avoid scalability bottlenecks.
> > You
> > > > >> could
> > > > >> >>>>>>>> probably push more RPCs per second, for example.  A garbage
> > > > >> >>>>> collection
> > > > >> >>>>>>>> in one broker would not affect the others.  It would be
> > > > >> interesting
> > > > >> >>>>> to
> > > > >> >>>>>>>> see this considered in the "alternate designs" design,
> > even if
> > > > >> you
> > > > >> >>>>> end
> > > > >> >>>>>>>> up deciding it's not the way to go.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> best,
> > > > >> >>>>>>>> Colin
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote:
> > > > >> >>>>>>>>> Hi all,
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> We created KIP-112: Handle disk failure for JBOD. Please
> > find
> > > > >> the
> > > > >> >>>>> KIP
> > > > >> >>>>>>>>> wiki
> > > > >> >>>>>>>>> in the link https://cwiki.apache.org/confl
> > > > >> >>>> uence/display/KAFKA/KIP-
> > > > >> >>>>>>>>> 112%3A+Handle+disk+failure+for+JBOD.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> This KIP is related to KIP-113
> > > > >> >>>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > >> >>>>>>>> 113%3A+Support+replicas+movement+between+log+directories>:
> > > > >> >>>>>>>>> Support replicas movement between log directories. They
> > are
> > > > >> >>>> needed
> > > > >> >>>>> in
> > > > >> >>>>>>>>> order
> > > > >> >>>>>>>>> to support JBOD in Kafka. Please help review the KIP. You
> > > > >> >>>> feedback
> > > > >> >>>>> is
> > > > >> >>>>>>>>> appreciated!
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Thanks,
> > > > >> >>>>>>>>> Dong
> > > > >> >>>>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >>
> > > > >
> >

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to