Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Stanislav Kozlovski Wed, 01 Aug 2018 09:11:48 -0700

Yes, good catch. Thank you, James!

Best,
Stanislav


On Wed, Aug 1, 2018 at 5:05 PM James Cheng <[email protected]> wrote:

> Can you update the KIP to say what the default is for
> max.uncleanable.partitions?
>
> -James
>
> Sent from my iPhone
>
> > On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <[email protected]>
> wrote:
> >
> > Hey group,
> >
> > I am planning on starting a voting thread tomorrow. Please do reply if
> you
> > feel there is anything left to discuss.
> >
> > Best,
> > Stanislav
> >
> > On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> [email protected]>
> > wrote:
> >
> >> Hey, Ray
> >>
> >> Thanks for pointing that out, it's fixed now
> >>
> >> Best,
> >> Stanislav
> >>
> >>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <[email protected]> wrote:
> >>>
> >>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> >>> the main KIP landing page
> >>> <
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> >?
> >>>
> >>> I tried, but the Wiki won't let me.
> >>>
> >>> -Ray
> >>>
> >>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>> Hey guys,
> >>>>
> >>>> @Colin - good point. I added some sentences mentioning recent
> >>> improvements
> >>>> in the introductory section.
> >>>>
> >>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> >>> fails,
> >>>> you don't want to work with it again. As such, I've changed my mind
> and
> >>>> believe that we should mark the LogDir (assume its a disk) as offline
> on
> >>>> the first `IOException` encountered. This is the LogCleaner's current
> >>>> behavior. We shouldn't change that.
> >>>>
> >>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> The
> >>>> correct approach in my mind is to either have it stay dead or never
> let
> >>> it
> >>>> die in the first place.
> >>>>
> >>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
> >>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
> >>> and
> >>>> inspect logs.
> >>>>
> >>>>
> >>>> Hey Ray,
> >>>>
> >>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to
> >>>>> skip over problematic partitions instead of dying.
> >>>> I think we can do this for every exception that isn't `IOException`.
> >>> This
> >>>> will future-proof us against bugs in the system and potential other
> >>> errors.
> >>>> Protecting yourself against unexpected failures is always a good thing
> >>> in
> >>>> my mind, but I also think that protecting yourself against bugs in the
> >>>> software is sort of clunky. What does everybody think about this?
> >>>>
> >>>>> 4) The only improvement I can think of is that if such an
> >>>>> error occurs, then have the option (configuration setting?) to
> create a
> >>>>> <log_segment>.skip file (or something similar).
> >>>> This is a good suggestion. Have others also seen corruption be
> generally
> >>>> tied to the same segment?
> >>>>
> >>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <[email protected]>
> >>> wrote:
> >>>>
> >>>>> For the cleaner thread specifically, I do not think respawning will
> >>> help at
> >>>>> all because we are more than likely to run into the same issue again
> >>> which
> >>>>> would end up crashing the cleaner. Retrying makes sense for transient
> >>>>> errors or when you believe some part of the system could have healed
> >>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>
> >>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>> <<<respawning threads is likely to make things worse, by putting you
> >>> in
> >>>>> an
> >>>>>> infinite loop which consumes resources and fires off continuous log
> >>>>>> messages.
> >>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
> >>> effect
> >>>>> is
> >>>>>> to implement a backoff mechanism (if a second respawn is to occur
> then
> >>>>> wait
> >>>>>> for 1 minute before doing it; then if a third respawn is to occur
> wait
> >>>>> for
> >>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
> some
> >>> max
> >>>>>> wait time).
> >>>>>>
> >>>>>> I have no opinion on whether respawn is appropriate or not in this
> >>>>> context,
> >>>>>> but a mitigation like the increasing backoff described above may be
> >>>>>> relevant in weighing the pros and cons.
> >>>>>>
> >>>>>> Ron
> >>>>>>
> >>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <[email protected]>
> >>> wrote:
> >>>>>>
> >>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>
> >>>>>>>> I agree that it would be good if the LogCleaner were more tolerant
> >>> of
> >>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> >>>>>>>>
> >>>>>>>> Things are better now than they used to be. We have the metric
> >>>>>>>>       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>>>>> which we can use to tell us if the threads are dead. And as of
> >>> 1.1.0,
> >>>>>> we
> >>>>>>>> have KIP-226, which allows you to restart the log cleaner thread,
> >>>>>>>> without requiring a broker restart.
> >>>>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>> <
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>
> >>>>>>>> I've only read about this, I haven't personally tried it.
> >>>>>>> Thanks for pointing this out, James!  Stanislav, we should probably
> >>>>> add a
> >>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
> KIP.
> >>>>>> Maybe
> >>>>>>> in the intro section?
> >>>>>>>
> >>>>>>> I think it's clear that requiring the users to manually restart the
> >>> log
> >>>>>>> cleaner is not a very good solution.  But it's good to know that
> >>> it's a
> >>>>>>> possibility on some older releases.
> >>>>>>>
> >>>>>>>> Some comments:
> >>>>>>>> * I like the idea of having the log cleaner continue to clean as
> >>> many
> >>>>>>>> partitions as it can, skipping over the problematic ones if
> >>> possible.
> >>>>>>>>
> >>>>>>>> * If the log cleaner thread dies, I think it should automatically
> be
> >>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> during
> >>>>>>>> execution, but I think we should go all the way and make sure
> that a
> >>>>>> new
> >>>>>>>> one gets created, if the thread ever dies.
> >>>>>>> This is inconsistent with the way the rest of Kafka works.  We
> don't
> >>>>>>> automatically re-create other threads in the broker if they
> >>> terminate.
> >>>>>> In
> >>>>>>> general, if there is a serious bug in the code, respawning threads
> is
> >>>>>>> likely to make things worse, by putting you in an infinite loop
> which
> >>>>>>> consumes resources and fires off continuous log messages.
> >>>>>>>
> >>>>>>>> * It might be worth trying to re-clean the uncleanable partitions.
> >>>>> I've
> >>>>>>>> seen cases where an uncleanable partition later became cleanable.
> I
> >>>>>>>> unfortunately don't remember how that happened, but I remember
> being
> >>>>>>>> surprised when I discovered it. It might have been something like
> a
> >>>>>>>> follower was uncleanable but after a leader election happened, the
> >>>>> log
> >>>>>>>> truncated and it was then cleanable again. I'm not sure.
> >>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
> >>> File
> >>>>>>> System (HDFS) and it was a constant source of user problems.
> >>>>>>>
> >>>>>>> What would happen is disks would just go bad over time.  The
> DataNode
> >>>>>>> would notice this and take them offline.  But then, due to some
> >>>>>>> "optimistic" code, the DataNode would periodically try to re-add
> them
> >>>>> to
> >>>>>>> the system.  Then one of two things would happen: the disk would
> just
> >>>>>> fail
> >>>>>>> immediately again, or it would appear to work and then fail after a
> >>>>> short
> >>>>>>> amount of time.
> >>>>>>>
> >>>>>>> The way the disk failed was normally having an I/O request take a
> >>>>> really
> >>>>>>> long time and time out.  So a bunch of request handler threads
> would
> >>>>>>> basically slam into a brick wall when they tried to access the bad
> >>>>> disk,
> >>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
> >>>>>> scenario,
> >>>>>>> if the disk appeared to work for a while, but then failed.  Any
> data
> >>>>> that
> >>>>>>> had been written on that DataNode to that disk would be lost, and
> we
> >>>>>> would
> >>>>>>> need to re-replicate it.
> >>>>>>>
> >>>>>>> Disks aren't biological systems-- they don't heal over time.  Once
> >>>>>> they're
> >>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
> cases
> >>>>>> where
> >>>>>>> the disk really is failing, and really is returning bad data or
> >>> timing
> >>>>>> out.
> >>>>>>>> * For your metrics, can you spell out the full metric in JMX-style
> >>>>>>>> format, such as:
> >>>>>>>>
> >>>>>>  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>>>>               value=4
> >>>>>>>>
> >>>>>>>> * For "uncleanable-partitions": topic-partition names can be very
> >>>>> long.
> >>>>>>>> I think the current max size is 210 characters (or maybe
> 240-ish?).
> >>>>>>>> Having the "uncleanable-partitions" being a list could be very
> large
> >>>>>>>> metric. Also, having the metric come out as a csv might be
> difficult
> >>>>> to
> >>>>>>>> work with for monitoring systems. If we *did* want the topic names
> >>> to
> >>>>>> be
> >>>>>>>> accessible, what do you think of having the
> >>>>>>>>       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> example
> >>>>> was
> >>>>>>>> that the topic and partition can be tags in the metric. That will
> >>>>> allow
> >>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
> not
> >>>>>>>> sure what the attribute for that metric would be. Maybe something
> >>>>> like
> >>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >>> time-since-last-clean?
> >>>>>> Or
> >>>>>>>> maybe even just "Value=1".
> >>>>>>> I haven't though about this that hard, but do we really need the
> >>>>>>> uncleanable topic names to be accessible through a metric?  It
> seems
> >>>>> like
> >>>>>>> the admin should notice that uncleanable partitions are present,
> and
> >>>>> then
> >>>>>>> check the logs?
> >>>>>>>
> >>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
> >>>>>>>> indicates that the disk is having problems. I'm not sure that is
> the
> >>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> problems,
> >>>>>> all
> >>>>>>>> of them are partition-level scenarios that happened during normal
> >>>>>>>> operation. None of them were indicative of disk problems.
> >>>>>>> I don't think this is a meaningful comparison.  In general, we
> don't
> >>>>>>> accept JIRAs for hard disk problems that happen on a particular
> >>>>> cluster.
> >>>>>>> If someone opened a JIRA that said "my hard disk is having
> problems"
> >>> we
> >>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> disk
> >>>>>>> problems don't happen, but  just that JIRA isn't the right place
> for
> >>>>>> them.
> >>>>>>> I do agree that the log cleaner has had a significant number of
> logic
> >>>>>>> bugs, and that we need to be careful to limit their impact.  That's
> >>> one
> >>>>>>> reason why I think that a threshold of "number of uncleanable logs"
> >>> is
> >>>>> a
> >>>>>>> good idea, rather than just failing after one IOException.  In all
> >>> the
> >>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it
> >>> was
> >>>>>>> just one partition that had the issue.  We also should increase
> test
> >>>>>>> coverage for the log cleaner.
> >>>>>>>
> >>>>>>>> * About marking disks as offline when exceeding a certain
> threshold,
> >>>>>>>> that actually increases the blast radius of log compaction
> failures.
> >>>>>>>> Currently, the uncleaned partitions are still readable and
> writable.
> >>>>>>>> Taking the disks offline would impact availability of the
> >>> uncleanable
> >>>>>>>> partitions, as well as impact all other partitions that are on the
> >>>>>> disk.
> >>>>>>> In general, when we encounter I/O errors, we take the disk
> partition
> >>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>>>>> ) :
> >>>>>>>
> >>>>>>>> - Broker assumes a log directory to be good after it starts, and
> >>> mark
> >>>>>>> log directory as
> >>>>>>>> bad once there is IOException when broker attempts to access (i.e.
> >>>>> read
> >>>>>>> or write) the log directory.
> >>>>>>>> - Broker will be offline if all log directories are bad.
> >>>>>>>> - Broker will stop serving replicas in any bad log directory. New
> >>>>>>> replicas will only be created
> >>>>>>>> on good log directory.
> >>>>>>> The behavior Stanislav is proposing for the log cleaner is actually
> >>>>> more
> >>>>>>> optimistic than what we do for regular broker I/O, since we will
> >>>>> tolerate
> >>>>>>> multiple IOExceptions, not just one.  But it's generally
> consistent.
> >>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
> >>>>>> unlimited
> >>>>>>> number of I/O errors, you can just set the threshold to an infinite
> >>>>> value
> >>>>>>> (although I think that would be a bad idea).
> >>>>>>>
> >>>>>>> best,
> >>>>>>> Colin
> >>>>>>>
> >>>>>>>> -James
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> >>>>>>> [email protected]> wrote:
> >>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
> Here
> >>>>>> is
> >>>>>>> the
> >>>>>>>>> new link:
> >>>>>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> >>>>>>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey group,
> >>>>>>>>>>
> >>>>>>>>>> I created a new KIP about making log compaction more
> >>>>> fault-tolerant.
> >>>>>>>>>> Please give it a look here and please share what you think,
> >>>>>>> especially in
> >>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
> >>>>>>>>>>
> >>>>>>>>>> KIP: KIP-346
> >>>>>>>>>> <
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>>>>> --
> >>>>>>>>>> Best,
> >>>>>>>>>> Stanislav
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best,
> >>>>>>>>> Stanislav
> >>>>
> >>>
> >>>
> >>
> >> --
> >> Best,
> >> Stanislav
> >>
> >
> >
> > --
> > Best,
> > Stanislav
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to