Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Stanislav Kozlovski Sat, 04 Aug 2018 10:54:10 -0700

Hey Ray,

Thanks for the explanation. In regards to the configuration property - I'm
not sure. As long as it has sufficient documentation, I find
"max.uncleanable.partitions" to be okay. If we were to add the distinction
explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ?


On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <[email protected]> wrote:

> One more thing occurred to me.  Should the configuration property be
> named "max.uncleanable.partitions.per.disk" instead?
>
> -Ray
>
>
> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > Yes, good catch. Thank you, James!
> >
> > Best,
> > Stanislav
> >
> > On Wed, Aug 1, 2018 at 5:05 PM James Cheng <[email protected]> wrote:
> >
> >> Can you update the KIP to say what the default is for
> >> max.uncleanable.partitions?
> >>
> >> -James
> >>
> >> Sent from my iPhone
> >>
> >>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> [email protected]>
> >> wrote:
> >>> Hey group,
> >>>
> >>> I am planning on starting a voting thread tomorrow. Please do reply if
> >> you
> >>> feel there is anything left to discuss.
> >>>
> >>> Best,
> >>> Stanislav
> >>>
> >>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> >> [email protected]>
> >>> wrote:
> >>>
> >>>> Hey, Ray
> >>>>
> >>>> Thanks for pointing that out, it's fixed now
> >>>>
> >>>> Best,
> >>>> Stanislav
> >>>>
> >>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <[email protected]>
> wrote:
> >>>>>
> >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> >>>>> the main KIP landing page
> >>>>> <
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> >>> ?
> >>>>> I tried, but the Wiki won't let me.
> >>>>>
> >>>>> -Ray
> >>>>>
> >>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>>>> Hey guys,
> >>>>>>
> >>>>>> @Colin - good point. I added some sentences mentioning recent
> >>>>> improvements
> >>>>>> in the introductory section.
> >>>>>>
> >>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> >>>>> fails,
> >>>>>> you don't want to work with it again. As such, I've changed my mind
> >> and
> >>>>>> believe that we should mark the LogDir (assume its a disk) as
> offline
> >> on
> >>>>>> the first `IOException` encountered. This is the LogCleaner's
> current
> >>>>>> behavior. We shouldn't change that.
> >>>>>>
> >>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> >> The
> >>>>>> correct approach in my mind is to either have it stay dead or never
> >> let
> >>>>> it
> >>>>>> die in the first place.
> >>>>>>
> >>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric
> is
> >>>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> metric
> >>>>> and
> >>>>>> inspect logs.
> >>>>>>
> >>>>>>
> >>>>>> Hey Ray,
> >>>>>>
> >>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
> to
> >>>>>>> skip over problematic partitions instead of dying.
> >>>>>> I think we can do this for every exception that isn't `IOException`.
> >>>>> This
> >>>>>> will future-proof us against bugs in the system and potential other
> >>>>> errors.
> >>>>>> Protecting yourself against unexpected failures is always a good
> thing
> >>>>> in
> >>>>>> my mind, but I also think that protecting yourself against bugs in
> the
> >>>>>> software is sort of clunky. What does everybody think about this?
> >>>>>>
> >>>>>>> 4) The only improvement I can think of is that if such an
> >>>>>>> error occurs, then have the option (configuration setting?) to
> >> create a
> >>>>>>> <log_segment>.skip file (or something similar).
> >>>>>> This is a good suggestion. Have others also seen corruption be
> >> generally
> >>>>>> tied to the same segment?
> >>>>>>
> >>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <[email protected]
> >
> >>>>> wrote:
> >>>>>>> For the cleaner thread specifically, I do not think respawning will
> >>>>> help at
> >>>>>>> all because we are more than likely to run into the same issue
> again
> >>>>> which
> >>>>>>> would end up crashing the cleaner. Retrying makes sense for
> transient
> >>>>>>> errors or when you believe some part of the system could have
> healed
> >>>>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>>>
> >>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <[email protected]>
> >>>>> wrote:
> >>>>>>>> <<<respawning threads is likely to make things worse, by putting
> you
> >>>>> in
> >>>>>>> an
> >>>>>>>> infinite loop which consumes resources and fires off continuous
> log
> >>>>>>>> messages.
> >>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate this
> >>>>> effect
> >>>>>>> is
> >>>>>>>> to implement a backoff mechanism (if a second respawn is to occur
> >> then
> >>>>>>> wait
> >>>>>>>> for 1 minute before doing it; then if a third respawn is to occur
> >> wait
> >>>>>>> for
> >>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to
> >> some
> >>>>> max
> >>>>>>>> wait time).
> >>>>>>>>
> >>>>>>>> I have no opinion on whether respawn is appropriate or not in this
> >>>>>>> context,
> >>>>>>>> but a mitigation like the increasing backoff described above may
> be
> >>>>>>>> relevant in weighing the pros and cons.
> >>>>>>>>
> >>>>>>>> Ron
> >>>>>>>>
> >>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <[email protected]>
> >>>>> wrote:
> >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>>>
> >>>>>>>>>> I agree that it would be good if the LogCleaner were more
> tolerant
> >>>>> of
> >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead.
> >>>>>>>>>>
> >>>>>>>>>> Things are better now than they used to be. We have the metric
> >>>>>>>>>>
> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>>>>>>> which we can use to tell us if the threads are dead. And as of
> >>>>> 1.1.0,
> >>>>>>>> we
> >>>>>>>>>> have KIP-226, which allows you to restart the log cleaner
> thread,
> >>>>>>>>>> without requiring a broker restart.
> >>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>>>> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>>>> I've only read about this, I haven't personally tried it.
> >>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
> probably
> >>>>>>> add a
> >>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the
> >> KIP.
> >>>>>>>> Maybe
> >>>>>>>>> in the intro section?
> >>>>>>>>>
> >>>>>>>>> I think it's clear that requiring the users to manually restart
> the
> >>>>> log
> >>>>>>>>> cleaner is not a very good solution.  But it's good to know that
> >>>>> it's a
> >>>>>>>>> possibility on some older releases.
> >>>>>>>>>
> >>>>>>>>>> Some comments:
> >>>>>>>>>> * I like the idea of having the log cleaner continue to clean as
> >>>>> many
> >>>>>>>>>> partitions as it can, skipping over the problematic ones if
> >>>>> possible.
> >>>>>>>>>> * If the log cleaner thread dies, I think it should
> automatically
> >> be
> >>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> >> during
> >>>>>>>>>> execution, but I think we should go all the way and make sure
> >> that a
> >>>>>>>> new
> >>>>>>>>>> one gets created, if the thread ever dies.
> >>>>>>>>> This is inconsistent with the way the rest of Kafka works.  We
> >> don't
> >>>>>>>>> automatically re-create other threads in the broker if they
> >>>>> terminate.
> >>>>>>>> In
> >>>>>>>>> general, if there is a serious bug in the code, respawning
> threads
> >> is
> >>>>>>>>> likely to make things worse, by putting you in an infinite loop
> >> which
> >>>>>>>>> consumes resources and fires off continuous log messages.
> >>>>>>>>>
> >>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> partitions.
> >>>>>>> I've
> >>>>>>>>>> seen cases where an uncleanable partition later became
> cleanable.
> >> I
> >>>>>>>>>> unfortunately don't remember how that happened, but I remember
> >> being
> >>>>>>>>>> surprised when I discovered it. It might have been something
> like
> >> a
> >>>>>>>>>> follower was uncleanable but after a leader election happened,
> the
> >>>>>>> log
> >>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> >>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> Distributed
> >>>>> File
> >>>>>>>>> System (HDFS) and it was a constant source of user problems.
> >>>>>>>>>
> >>>>>>>>> What would happen is disks would just go bad over time.  The
> >> DataNode
> >>>>>>>>> would notice this and take them offline.  But then, due to some
> >>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add
> >> them
> >>>>>>> to
> >>>>>>>>> the system.  Then one of two things would happen: the disk would
> >> just
> >>>>>>>> fail
> >>>>>>>>> immediately again, or it would appear to work and then fail
> after a
> >>>>>>> short
> >>>>>>>>> amount of time.
> >>>>>>>>>
> >>>>>>>>> The way the disk failed was normally having an I/O request take a
> >>>>>>> really
> >>>>>>>>> long time and time out.  So a bunch of request handler threads
> >> would
> >>>>>>>>> basically slam into a brick wall when they tried to access the
> bad
> >>>>>>> disk,
> >>>>>>>>> slowing the DataNode to a crawl.  It was even worse in the second
> >>>>>>>> scenario,
> >>>>>>>>> if the disk appeared to work for a while, but then failed.  Any
> >> data
> >>>>>>> that
> >>>>>>>>> had been written on that DataNode to that disk would be lost, and
> >> we
> >>>>>>>> would
> >>>>>>>>> need to re-replicate it.
> >>>>>>>>>
> >>>>>>>>> Disks aren't biological systems-- they don't heal over time.
> Once
> >>>>>>>> they're
> >>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust against
> >> cases
> >>>>>>>> where
> >>>>>>>>> the disk really is failing, and really is returning bad data or
> >>>>> timing
> >>>>>>>> out.
> >>>>>>>>>> * For your metrics, can you spell out the full metric in
> JMX-style
> >>>>>>>>>> format, such as:
> >>>>>>>>>>
> >>>>>>>>
>  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>>>>>>                value=4
> >>>>>>>>>>
> >>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be
> very
> >>>>>>> long.
> >>>>>>>>>> I think the current max size is 210 characters (or maybe
> >> 240-ish?).
> >>>>>>>>>> Having the "uncleanable-partitions" being a list could be very
> >> large
> >>>>>>>>>> metric. Also, having the metric come out as a csv might be
> >> difficult
> >>>>>>> to
> >>>>>>>>>> work with for monitoring systems. If we *did* want the topic
> names
> >>>>> to
> >>>>>>>> be
> >>>>>>>>>> accessible, what do you think of having the
> >>>>>>>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my
> >> example
> >>>>>>> was
> >>>>>>>>>> that the topic and partition can be tags in the metric. That
> will
> >>>>>>> allow
> >>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm
> >> not
> >>>>>>>>>> sure what the attribute for that metric would be. Maybe
> something
> >>>>>>> like
> >>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >>>>> time-since-last-clean?
> >>>>>>>> Or
> >>>>>>>>>> maybe even just "Value=1".
> >>>>>>>>> I haven't though about this that hard, but do we really need the
> >>>>>>>>> uncleanable topic names to be accessible through a metric?  It
> >> seems
> >>>>>>> like
> >>>>>>>>> the admin should notice that uncleanable partitions are present,
> >> and
> >>>>>>> then
> >>>>>>>>> check the logs?
> >>>>>>>>>
> >>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely
> >>>>>>>>>> indicates that the disk is having problems. I'm not sure that is
> >> the
> >>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> >> problems,
> >>>>>>>> all
> >>>>>>>>>> of them are partition-level scenarios that happened during
> normal
> >>>>>>>>>> operation. None of them were indicative of disk problems.
> >>>>>>>>> I don't think this is a meaningful comparison.  In general, we
> >> don't
> >>>>>>>>> accept JIRAs for hard disk problems that happen on a particular
> >>>>>>> cluster.
> >>>>>>>>> If someone opened a JIRA that said "my hard disk is having
> >> problems"
> >>>>> we
> >>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove that
> >> disk
> >>>>>>>>> problems don't happen, but  just that JIRA isn't the right place
> >> for
> >>>>>>>> them.
> >>>>>>>>> I do agree that the log cleaner has had a significant number of
> >> logic
> >>>>>>>>> bugs, and that we need to be careful to limit their impact.
> That's
> >>>>> one
> >>>>>>>>> reason why I think that a threshold of "number of uncleanable
> logs"
> >>>>> is
> >>>>>>> a
> >>>>>>>>> good idea, rather than just failing after one IOException.  In
> all
> >>>>> the
> >>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner,
> it
> >>>>> was
> >>>>>>>>> just one partition that had the issue.  We also should increase
> >> test
> >>>>>>>>> coverage for the log cleaner.
> >>>>>>>>>
> >>>>>>>>>> * About marking disks as offline when exceeding a certain
> >> threshold,
> >>>>>>>>>> that actually increases the blast radius of log compaction
> >> failures.
> >>>>>>>>>> Currently, the uncleaned partitions are still readable and
> >> writable.
> >>>>>>>>>> Taking the disks offline would impact availability of the
> >>>>> uncleanable
> >>>>>>>>>> partitions, as well as impact all other partitions that are on
> the
> >>>>>>>> disk.
> >>>>>>>>> In general, when we encounter I/O errors, we take the disk
> >> partition
> >>>>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>>>>>>> ) :
> >>>>>>>>>
> >>>>>>>>>> - Broker assumes a log directory to be good after it starts, and
> >>>>> mark
> >>>>>>>>> log directory as
> >>>>>>>>>> bad once there is IOException when broker attempts to access
> (i.e.
> >>>>>>> read
> >>>>>>>>> or write) the log directory.
> >>>>>>>>>> - Broker will be offline if all log directories are bad.
> >>>>>>>>>> - Broker will stop serving replicas in any bad log directory.
> New
> >>>>>>>>> replicas will only be created
> >>>>>>>>>> on good log directory.
> >>>>>>>>> The behavior Stanislav is proposing for the log cleaner is
> actually
> >>>>>>> more
> >>>>>>>>> optimistic than what we do for regular broker I/O, since we will
> >>>>>>> tolerate
> >>>>>>>>> multiple IOExceptions, not just one.  But it's generally
> >> consistent.
> >>>>>>>>> Ignoring errors is not.  In any case, if you want to tolerate an
> >>>>>>>> unlimited
> >>>>>>>>> number of I/O errors, you can just set the threshold to an
> infinite
> >>>>>>> value
> >>>>>>>>> (although I think that would be a bad idea).
> >>>>>>>>>
> >>>>>>>>> best,
> >>>>>>>>> Colin
> >>>>>>>>>
> >>>>>>>>>> -James
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> >>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that.
> >> Here
> >>>>>>>> is
> >>>>>>>>> the
> >>>>>>>>>>> new link:
> >>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> >>>>>>>>> [email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hey group,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I created a new KIP about making log compaction more
> >>>>>>> fault-tolerant.
> >>>>>>>>>>>> Please give it a look here and please share what you think,
> >>>>>>>>> especially in
> >>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph.
> >>>>>>>>>>>>
> >>>>>>>>>>>> KIP: KIP-346
> >>>>>>>>>>>> <
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Stanislav
> >>>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Stanislav
> >>>>>
> >>>> --
> >>>> Best,
> >>>> Stanislav
> >>>>
> >>>
> >>> --
> >>> Best,
> >>> Stanislav
> >
>
>

-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to