Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Stanislav Kozlovski Thu, 26 Jul 2018 14:02:02 -0700

Hey guys,

@Colin - good point. I added some sentences mentioning recent improvements
in the introductory section.


*Disk Failure* - I tend to agree with what Colin said - once a disk fails,
you don't want to work with it again. As such, I've changed my mind and
believe that we should mark the LogDir (assume its a disk) as offline on
the first `IOException` encountered. This is the LogCleaner's current
behavior. We shouldn't change that.

*Respawning Threads* - I believe we should never re-spawn a thread. The
correct approach in my mind is to either have it stay dead or never let it
die in the first place.

*Uncleanable-partition-names metric* - Colin is right, this metric is
unneeded. Users can monitor the `uncleanable-partitions-count` metric and
inspect logs.


Hey Ray,

> 2) I'm 100% with James in agreement with setting up the LogCleaner to
> skip over problematic partitions instead of dying.
I think we can do this for every exception that isn't `IOException`. This
will future-proof us against bugs in the system and potential other errors.
Protecting yourself against unexpected failures is always a good thing in
my mind, but I also think that protecting yourself against bugs in the
software is sort of clunky. What does everybody think about this?

> 4) The only improvement I can think of is that if such an
> error occurs, then have the option (configuration setting?) to create a
> <log_segment>.skip file (or something similar).
This is a good suggestion. Have others also seen corruption be generally
tied to the same segment?

On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <[email protected]> wrote:

> For the cleaner thread specifically, I do not think respawning will help at
> all because we are more than likely to run into the same issue again which
> would end up crashing the cleaner. Retrying makes sense for transient
> errors or when you believe some part of the system could have healed
> itself, both of which I think are not true for the log cleaner.
>
> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <[email protected]> wrote:
>
> > <<<respawning threads is likely to make things worse, by putting you in
> an
> > infinite loop which consumes resources and fires off continuous log
> > messages.
> > Hi Colin.  In case it could be relevant, one way to mitigate this effect
> is
> > to implement a backoff mechanism (if a second respawn is to occur then
> wait
> > for 1 minute before doing it; then if a third respawn is to occur wait
> for
> > 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to some max
> > wait time).
> >
> > I have no opinion on whether respawn is appropriate or not in this
> context,
> > but a mitigation like the increasing backoff described above may be
> > relevant in weighing the pros and cons.
> >
> > Ron
> >
> > On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <[email protected]> wrote:
> >
> > > On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > > > Hi Stanislav! Thanks for this KIP!
> > > >
> > > > I agree that it would be good if the LogCleaner were more tolerant of
> > > > errors. Currently, as you said, once it dies, it stays dead.
> > > >
> > > > Things are better now than they used to be. We have the metric
> > > >       kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> > > > which we can use to tell us if the threads are dead. And as of 1.1.0,
> > we
> > > > have KIP-226, which allows you to restart the log cleaner thread,
> > > > without requiring a broker restart.
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> > >
> > >
> > > > I've only read about this, I haven't personally tried it.
> > >
> > > Thanks for pointing this out, James!  Stanislav, we should probably
> add a
> > > sentence or two mentioning the KIP-226 changes somewhere in the KIP.
> > Maybe
> > > in the intro section?
> > >
> > > I think it's clear that requiring the users to manually restart the log
> > > cleaner is not a very good solution.  But it's good to know that it's a
> > > possibility on some older releases.
> > >
> > > >
> > > > Some comments:
> > > > * I like the idea of having the log cleaner continue to clean as many
> > > > partitions as it can, skipping over the problematic ones if possible.
> > > >
> > > > * If the log cleaner thread dies, I think it should automatically be
> > > > revived. Your KIP attempts to do that by catching exceptions during
> > > > execution, but I think we should go all the way and make sure that a
> > new
> > > > one gets created, if the thread ever dies.
> > >
> > > This is inconsistent with the way the rest of Kafka works.  We don't
> > > automatically re-create other threads in the broker if they terminate.
> > In
> > > general, if there is a serious bug in the code, respawning threads is
> > > likely to make things worse, by putting you in an infinite loop which
> > > consumes resources and fires off continuous log messages.
> > >
> > > >
> > > > * It might be worth trying to re-clean the uncleanable partitions.
> I've
> > > > seen cases where an uncleanable partition later became cleanable. I
> > > > unfortunately don't remember how that happened, but I remember being
> > > > surprised when I discovered it. It might have been something like a
> > > > follower was uncleanable but after a leader election happened, the
> log
> > > > truncated and it was then cleanable again. I'm not sure.
> > >
> > > James, I disagree.  We had this behavior in the Hadoop Distributed File
> > > System (HDFS) and it was a constant source of user problems.
> > >
> > > What would happen is disks would just go bad over time.  The DataNode
> > > would notice this and take them offline.  But then, due to some
> > > "optimistic" code, the DataNode would periodically try to re-add them
> to
> > > the system.  Then one of two things would happen: the disk would just
> > fail
> > > immediately again, or it would appear to work and then fail after a
> short
> > > amount of time.
> > >
> > > The way the disk failed was normally having an I/O request take a
> really
> > > long time and time out.  So a bunch of request handler threads would
> > > basically slam into a brick wall when they tried to access the bad
> disk,
> > > slowing the DataNode to a crawl.  It was even worse in the second
> > scenario,
> > > if the disk appeared to work for a while, but then failed.  Any data
> that
> > > had been written on that DataNode to that disk would be lost, and we
> > would
> > > need to re-replicate it.
> > >
> > > Disks aren't biological systems-- they don't heal over time.  Once
> > they're
> > > bad, they stay bad.  The log cleaner needs to be robust against cases
> > where
> > > the disk really is failing, and really is returning bad data or timing
> > out.
> > >
> > > >
> > > > * For your metrics, can you spell out the full metric in JMX-style
> > > > format, such as:
> > > >
> >  kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> > > >               value=4
> > > >
> > > > * For "uncleanable-partitions": topic-partition names can be very
> long.
> > > > I think the current max size is 210 characters (or maybe 240-ish?).
> > > > Having the "uncleanable-partitions" being a list could be very large
> > > > metric. Also, having the metric come out as a csv might be difficult
> to
> > > > work with for monitoring systems. If we *did* want the topic names to
> > be
> > > > accessible, what do you think of having the
> > > >       kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> > > > I'm not sure if LogCleanerManager is the right type, but my example
> was
> > > > that the topic and partition can be tags in the metric. That will
> allow
> > > > monitoring systems to more easily slice and dice the metric. I'm not
> > > > sure what the attribute for that metric would be. Maybe something
> like
> > > > "uncleaned bytes" for that topic-partition? Or time-since-last-clean?
> > Or
> > > > maybe even just "Value=1".
> > >
> > > I haven't though about this that hard, but do we really need the
> > > uncleanable topic names to be accessible through a metric?  It seems
> like
> > > the admin should notice that uncleanable partitions are present, and
> then
> > > check the logs?
> > >
> > > >
> > > > * About `max.uncleanable.partitions`, you said that this likely
> > > > indicates that the disk is having problems. I'm not sure that is the
> > > > case. For the 4 JIRAs that you mentioned about log cleaner problems,
> > all
> > > > of them are partition-level scenarios that happened during normal
> > > > operation. None of them were indicative of disk problems.
> > >
> > > I don't think this is a meaningful comparison.  In general, we don't
> > > accept JIRAs for hard disk problems that happen on a particular
> cluster.
> > > If someone opened a JIRA that said "my hard disk is having problems" we
> > > could close that as "not a Kafka bug."  This doesn't prove that disk
> > > problems don't happen, but  just that JIRA isn't the right place for
> > them.
> > >
> > > I do agree that the log cleaner has had a significant number of logic
> > > bugs, and that we need to be careful to limit their impact.  That's one
> > > reason why I think that a threshold of "number of uncleanable logs" is
> a
> > > good idea, rather than just failing after one IOException.  In all the
> > > cases I've seen where a user hit a logic bug in the log cleaner, it was
> > > just one partition that had the issue.  We also should increase test
> > > coverage for the log cleaner.
> > >
> > > > * About marking disks as offline when exceeding a certain threshold,
> > > > that actually increases the blast radius of log compaction failures.
> > > > Currently, the uncleaned partitions are still readable and writable.
> > > > Taking the disks offline would impact availability of the uncleanable
> > > > partitions, as well as impact all other partitions that are on the
> > disk.
> > >
> > > In general, when we encounter I/O errors, we take the disk partition
> > > offline.  This is spelled out in KIP-112 (
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> > > ) :
> > >
> > > > - Broker assumes a log directory to be good after it starts, and mark
> > > log directory as
> > > > bad once there is IOException when broker attempts to access (i.e.
> read
> > > or write) the log directory.
> > > > - Broker will be offline if all log directories are bad.
> > > > - Broker will stop serving replicas in any bad log directory. New
> > > replicas will only be created
> > > > on good log directory.
> > >
> > > The behavior Stanislav is proposing for the log cleaner is actually
> more
> > > optimistic than what we do for regular broker I/O, since we will
> tolerate
> > > multiple IOExceptions, not just one.  But it's generally consistent.
> > > Ignoring errors is not.  In any case, if you want to tolerate an
> > unlimited
> > > number of I/O errors, you can just set the threshold to an infinite
> value
> > > (although I think that would be a bad idea).
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > -James
> > > >
> > > >
> > > > > On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski <
> > > [email protected]> wrote:
> > > > >
> > > > > I renamed the KIP and that changed the link. Sorry about that. Here
> > is
> > > the
> > > > > new link:
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> > > > >
> > > > > On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > >> Hey group,
> > > > >>
> > > > >> I created a new KIP about making log compaction more
> fault-tolerant.
> > > > >> Please give it a look here and please share what you think,
> > > especially in
> > > > >> regards to the points in the "Needs Discussion" paragraph.
> > > > >>
> > > > >> KIP: KIP-346
> > > > >> <
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> > > >
> > > > >> --
> > > > >> Best,
> > > > >> Stanislav
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Best,
> > > > > Stanislav
> > > >
> > >
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to