Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Colin McCabe Mon, 06 Aug 2018 10:59:13 -0700

Perhaps we could start with max.uncleanable.partitions and then implement 
max.uncleanable.partitions.per.logdir in a follow-up change if it seemed to be 
necessary?  What do you think?


regards,
Colin


On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> Hey Ray,
> 
> Thanks for the explanation. In regards to the configuration property - I'm
> not sure. As long as it has sufficient documentation, I find
> "max.uncleanable.partitions" to be okay. If we were to add the distinction
> explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ?
> 
> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <[email protected]> wrote:
> 
> > One more thing occurred to me.  Should the configuration property be
> > named "max.uncleanable.partitions.per.disk" instead?
> >
> > -Ray
> >
> >
> > On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > > Yes, good catch. Thank you, James!
> > >
> > > Best,
> > > Stanislav
> > >
> > > On Wed, Aug 1, 2018 at 5:05 PM James Cheng <[email protected]> wrote:
> > >
> > >> Can you update the KIP to say what the default is for
> > >> max.uncleanable.partitions?
> > >>
> > >> -James
> > >>
> > >> Sent from my iPhone
> > >>
> > >>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> > [email protected]>
> > >> wrote:
> > >>> Hey group,
> > >>>
> > >>> I am planning on starting a voting thread tomorrow. Please do reply if
> > >> you
> > >>> feel there is anything left to discuss.
> > >>>
> > >>> Best,
> > >>> Stanislav
> > >>>
> > >>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> > >> [email protected]>
> > >>> wrote:
> > >>>
> > >>>> Hey, Ray
> > >>>>
> > >>>> Thanks for pointing that out, it's fixed now
> > >>>>
> > >>>> Best,
> > >>>> Stanislav
> > >>>>
> > >>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <[email protected]>
> > wrote:
> > >>>>>
> > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> > >>>>> the main KIP landing page
> > >>>>> <
> > >>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> > >>> ?
> > >>>>> I tried, but the Wiki won't let me.
> > >>>>>
> > >>>>> -Ray
> > >>>>>
> > >>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > >>>>>> Hey guys,
> > >>>>>>
> > >>>>>> @Colin - good point. I added some sentences mentioning recent
> > >>>>> improvements
> > >>>>>> in the introductory section.
> > >>>>>>
> > >>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> > >>>>> fails,
> > >>>>>> you don't want to work with it again. As such, I've changed my mind
> > >> and
> > >>>>>> believe that we should mark the LogDir (assume its a disk) as
> > offline
> > >> on
> > >>>>>> the first `IOException` encountered. This is the LogCleaner's
> > current
> > >>>>>> behavior. We shouldn't change that.
> > >>>>>>
> > >>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> > >> The
> > >>>>>> correct approach in my mind is to either have it stay dead or never
> > >> let
> > >>>>> it
> > >>>>>> die in the first place.
> > >>>>>>
> > >>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric
> > is
> > >>>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> > metric
> > >>>>> and
> > >>>>>> inspect logs.
> > >>>>>>
> > >>>>>>
> > >>>>>> Hey Ray,
> > >>>>>>
> > >>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
> > to
> > >>>>>>> skip over problematic partitions instead of dying.
> > >>>>>> I think we can do this for every exception that isn't `IOException`.
> > >>>>> This
> > >>>>>> will future-proof us against bugs in the system and potential other
> > >>>>> errors.
> > >>>>>> Protecting yourself against unexpected failures is always a good
> > thing
> > >>>>> in
> > >>>>>> my mind, but I also think that protecting yourself against bugs in
> > the
> > >>>>>> software is sort of clunky. What does everybody think about this?
> > >>>>>>
> > >>>>>>> 4) The only improvement I can think of is that if such an
> > >>>>>>> error occurs, then have the option (configuration setting?) to
> > >> create a
> > >>>>>>> <log_segment>.skip file (or something similar).
> > >>>>>> This is a good suggestion. Have others also seen corruption be
> > >> generally
> > >>>>>> tied to the same segment?
> > >>>>>>
> > >>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <[email protected]
> > >
> > >>>>> wrote:
> > >>>>>>> For the cleaner thread specifically, I do not think respawning will
> > >>>>> help at
> > >>>>>>> all because we are more than likely to run into the same issue
> > again
> > >>>>> which
> > >>>>>>> would end up crashing the cleaner. Retrying makes sense for
> > transient
> > >>>>>>> errors or when you believe some part of the system could have
> > healed
> > >>>>>>> itself, both of which I think are not true for the log cleaner.
> > >>>>>>>
> > >>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <[email protected]>
> > >>>>> wrote:
> > >>>>>>>> <<<respawning threads is likely to make things worse, by putting
> > you
> > >>>>> in
> > >>>>>>> an
> > >>>>>>>> infinite loop which consumes resources and fires off continuous
> > log
> > >>>>>>>> messages.
> > >>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
> > >>>>> effect
> > >>>>>>> is
> > >>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
> > >> then
> > >>>>>>> wait
> > >>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
> > >> wait
> > >>>>>>> for
> > >>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
> > >> some
> > >>>>> max
> > >>>>>>>> wait time).
> > >>>>>>>>
> > >>>>>>>> I have no opinion on whether respawn is appropriate or not in this
> > >>>>>>> context,
> > >>>>>>>> but a mitigation like the increasing backoff described above may
> > be
> > >>>>>>>> relevant in weighing the pros and cons.
> > >>>>>>>>
> > >>>>>>>> Ron
> > >>>>>>>>
> > >>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <[email protected]>
> > >>>>> wrote:
> > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > >>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> > >>>>>>>>>>
> > >>>>>>>>>> I agree that it would be good if the LogCleaner were more
> > tolerant
> > >>>>> of
> > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> > >>>>>>>>>>
> > >>>>>>>>>> Things are better now than they used to be. We have the metric
> > >>>>>>>>>>
> > kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> > >>>>>>>>>> which we can use to tell us if the threads are dead. And as of
> > >>>>> 1.1.0,
> > >>>>>>>> we
> > >>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
> > thread,
> > >>>>>>>>>> without requiring a broker restart.
> > >>>>>>>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > >>>>>>>>>> <
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > >>>>>>>>>> I've only read about this, I haven't personally tried it.
> > >>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
> > probably
> > >>>>>>> add a
> > >>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
> > >> KIP.
> > >>>>>>>> Maybe
> > >>>>>>>>> in the intro section?
> > >>>>>>>>>
> > >>>>>>>>> I think it's clear that requiring the users to manually restart
> > the
> > >>>>> log
> > >>>>>>>>> cleaner is not a very good solution.  But it's good to know that
> > >>>>> it's a
> > >>>>>>>>> possibility on some older releases.
> > >>>>>>>>>
> > >>>>>>>>>> Some comments:
> > >>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
> > >>>>> many
> > >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> > >>>>> possible.
> > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> > automatically
> > >> be
> > >>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> > >> during
> > >>>>>>>>>> execution, but I think we should go all the way and make sure
> > >> that a
> > >>>>>>>> new
> > >>>>>>>>>> one gets created, if the thread ever dies.
> > >>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
> > >> don't
> > >>>>>>>>> automatically re-create other threads in the broker if they
> > >>>>> terminate.
> > >>>>>>>> In
> > >>>>>>>>> general, if there is a serious bug in the code, respawning
> > threads
> > >> is
> > >>>>>>>>> likely to make things worse, by putting you in an infinite loop
> > >> which
> > >>>>>>>>> consumes resources and fires off continuous log messages.
> > >>>>>>>>>
> > >>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> > partitions.
> > >>>>>>> I've
> > >>>>>>>>>> seen cases where an uncleanable partition later became
> > cleanable.
> > >> I
> > >>>>>>>>>> unfortunately don't remember how that happened, but I remember
> > >> being
> > >>>>>>>>>> surprised when I discovered it. It might have been something
> > like
> > >> a
> > >>>>>>>>>> follower was uncleanable but after a leader election happened,
> > the
> > >>>>>>> log
> > >>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> > >>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> > Distributed
> > >>>>> File
> > >>>>>>>>> System (HDFS) and it was a constant source of user problems.
> > >>>>>>>>>
> > >>>>>>>>> What would happen is disks would just go bad over time.  The
> > >> DataNode
> > >>>>>>>>> would notice this and take them offline.  But then, due to some
> > >>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
> > >> them
> > >>>>>>> to
> > >>>>>>>>> the system.  Then one of two things would happen: the disk would
> > >> just
> > >>>>>>>> fail
> > >>>>>>>>> immediately again, or it would appear to work and then fail
> > after a
> > >>>>>>> short
> > >>>>>>>>> amount of time.
> > >>>>>>>>>
> > >>>>>>>>> The way the disk failed was normally having an I/O request take a
> > >>>>>>> really
> > >>>>>>>>> long time and time out.  So a bunch of request handler threads
> > >> would
> > >>>>>>>>> basically slam into a brick wall when they tried to access the
> > bad
> > >>>>>>> disk,
> > >>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
> > >>>>>>>> scenario,
> > >>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
> > >> data
> > >>>>>>> that
> > >>>>>>>>> had been written on that DataNode to that disk would be lost, and
> > >> we
> > >>>>>>>> would
> > >>>>>>>>> need to re-replicate it.
> > >>>>>>>>>
> > >>>>>>>>> Disks aren't biological systems-- they don't heal over time.
> > Once
> > >>>>>>>> they're
> > >>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
> > >> cases
> > >>>>>>>> where
> > >>>>>>>>> the disk really is failing, and really is returning bad data or
> > >>>>> timing
> > >>>>>>>> out.
> > >>>>>>>>>> * For your metrics, can you spell out the full metric in
> > JMX-style
> > >>>>>>>>>> format, such as:
> > >>>>>>>>>>
> > >>>>>>>>
> >  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> > >>>>>>>>>>                value=4
> > >>>>>>>>>>
> > >>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be
> > very
> > >>>>>>> long.
> > >>>>>>>>>> I think the current max size is 210 characters (or maybe
> > >> 240-ish?).
> > >>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
> > >> large
> > >>>>>>>>>> metric. Also, having the metric come out as a csv might be
> > >> difficult
> > >>>>>>> to
> > >>>>>>>>>> work with for monitoring systems. If we *did* want the topic
> > names
> > >>>>> to
> > >>>>>>>> be
> > >>>>>>>>>> accessible, what do you think of having the
> > >>>>>>>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> > >>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> > >> example
> > >>>>>>> was
> > >>>>>>>>>> that the topic and partition can be tags in the metric. That
> > will
> > >>>>>>> allow
> > >>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
> > >> not
> > >>>>>>>>>> sure what the attribute for that metric would be. Maybe
> > something
> > >>>>>>> like
> > >>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> > >>>>> time-since-last-clean?
> > >>>>>>>> Or
> > >>>>>>>>>> maybe even just "Value=1".
> > >>>>>>>>> I haven't though about this that hard, but do we really need the
> > >>>>>>>>> uncleanable topic names to be accessible through a metric?  It
> > >> seems
> > >>>>>>> like
> > >>>>>>>>> the admin should notice that uncleanable partitions are present,
> > >> and
> > >>>>>>> then
> > >>>>>>>>> check the logs?
> > >>>>>>>>>
> > >>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
> > >>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
> > >> the
> > >>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> > >> problems,
> > >>>>>>>> all
> > >>>>>>>>>> of them are partition-level scenarios that happened during
> > normal
> > >>>>>>>>>> operation. None of them were indicative of disk problems.
> > >>>>>>>>> I don't think this is a meaningful comparison.  In general, we
> > >> don't
> > >>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
> > >>>>>>> cluster.
> > >>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> > >> problems"
> > >>>>> we
> > >>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> > >> disk
> > >>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
> > >> for
> > >>>>>>>> them.
> > >>>>>>>>> I do agree that the log cleaner has had a significant number of
> > >> logic
> > >>>>>>>>> bugs, and that we need to be careful to limit their impact.
> > That's
> > >>>>> one
> > >>>>>>>>> reason why I think that a threshold of "number of uncleanable
> > logs"
> > >>>>> is
> > >>>>>>> a
> > >>>>>>>>> good idea, rather than just failing after one IOException.  In
> > all
> > >>>>> the
> > >>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
> > it
> > >>>>> was
> > >>>>>>>>> just one partition that had the issue.  We also should increase
> > >> test
> > >>>>>>>>> coverage for the log cleaner.
> > >>>>>>>>>
> > >>>>>>>>>> * About marking disks as offline when exceeding a certain
> > >> threshold,
> > >>>>>>>>>> that actually increases the blast radius of log compaction
> > >> failures.
> > >>>>>>>>>> Currently, the uncleaned partitions are still readable and
> > >> writable.
> > >>>>>>>>>> Taking the disks offline would impact availability of the
> > >>>>> uncleanable
> > >>>>>>>>>> partitions, as well as impact all other partitions that are on
> > the
> > >>>>>>>> disk.
> > >>>>>>>>> In general, when we encounter I/O errors, we take the disk
> > >> partition
> > >>>>>>>>> offline.  This is spelled out in KIP-112 (
> > >>>>>>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> > >>>>>>>>> ) :
> > >>>>>>>>>
> > >>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
> > >>>>> mark
> > >>>>>>>>> log directory as
> > >>>>>>>>>> bad once there is IOException when broker attempts to access
> > (i.e.
> > >>>>>>> read
> > >>>>>>>>> or write) the log directory.
> > >>>>>>>>>> - Broker will be offline if all log directories are bad.
> > >>>>>>>>>> - Broker will stop serving replicas in any bad log directory.
> > New
> > >>>>>>>>> replicas will only be created
> > >>>>>>>>>> on good log directory.
> > >>>>>>>>> The behavior Stanislav is proposing for the log cleaner is
> > actually
> > >>>>>>> more
> > >>>>>>>>> optimistic than what we do for regular broker I/O, since we will
> > >>>>>>> tolerate
> > >>>>>>>>> multiple IOExceptions, not just one.  But it's generally
> > >> consistent.
> > >>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
> > >>>>>>>> unlimited
> > >>>>>>>>> number of I/O errors, you can just set the threshold to an
> > infinite
> > >>>>>>> value
> > >>>>>>>>> (although I think that would be a bad idea).
> > >>>>>>>>>
> > >>>>>>>>> best,
> > >>>>>>>>> Colin
> > >>>>>>>>>
> > >>>>>>>>>> -James
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> > >>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
> > >> Here
> > >>>>>>>> is
> > >>>>>>>>> the
> > >>>>>>>>>>> new link:
> > >>>>>>>>>>>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > >>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> > >>>>>>>>> [email protected]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hey group,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I created a new KIP about making log compaction more
> > >>>>>>> fault-tolerant.
> > >>>>>>>>>>>> Please give it a look here and please share what you think,
> > >>>>>>>>> especially in
> > >>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> KIP: KIP-346
> > >>>>>>>>>>>> <
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Stanislav
> > >>>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Stanislav
> > >>>>>
> > >>>> --
> > >>>> Best,
> > >>>> Stanislav
> > >>>>
> > >>>
> > >>> --
> > >>> Best,
> > >>> Stanislav
> > >
> >
> >
> 
> -- 
> Best,
> Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to