Perhaps we could start with max.uncleanable.partitions and then implement max.uncleanable.partitions.per.logdir in a follow-up change if it seemed to be necessary? What do you think?
regards, Colin On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote: > Hey Ray, > > Thanks for the explanation. In regards to the configuration property - I'm > not sure. As long as it has sufficient documentation, I find > "max.uncleanable.partitions" to be okay. If we were to add the distinction > explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ? > > On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rchi...@apache.org> wrote: > > > One more thing occurred to me. Should the configuration property be > > named "max.uncleanable.partitions.per.disk" instead? > > > > -Ray > > > > > > On 8/1/18 9:11 AM, Stanislav Kozlovski wrote: > > > Yes, good catch. Thank you, James! > > > > > > Best, > > > Stanislav > > > > > > On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wushuja...@gmail.com> wrote: > > > > > >> Can you update the KIP to say what the default is for > > >> max.uncleanable.partitions? > > >> > > >> -James > > >> > > >> Sent from my iPhone > > >> > > >>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski < > > stanis...@confluent.io> > > >> wrote: > > >>> Hey group, > > >>> > > >>> I am planning on starting a voting thread tomorrow. Please do reply if > > >> you > > >>> feel there is anything left to discuss. > > >>> > > >>> Best, > > >>> Stanislav > > >>> > > >>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski < > > >> stanis...@confluent.io> > > >>> wrote: > > >>> > > >>>> Hey, Ray > > >>>> > > >>>> Thanks for pointing that out, it's fixed now > > >>>> > > >>>> Best, > > >>>> Stanislav > > >>>> > > >>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org> > > wrote: > > >>>>> > > >>>>> Thanks. Can you fix the link in the "KIPs under discussion" table on > > >>>>> the main KIP landing page > > >>>>> < > > >>>>> > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals# > > >>> ? > > >>>>> I tried, but the Wiki won't let me. > > >>>>> > > >>>>> -Ray > > >>>>> > > >>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: > > >>>>>> Hey guys, > > >>>>>> > > >>>>>> @Colin - good point. I added some sentences mentioning recent > > >>>>> improvements > > >>>>>> in the introductory section. > > >>>>>> > > >>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk > > >>>>> fails, > > >>>>>> you don't want to work with it again. As such, I've changed my mind > > >> and > > >>>>>> believe that we should mark the LogDir (assume its a disk) as > > offline > > >> on > > >>>>>> the first `IOException` encountered. This is the LogCleaner's > > current > > >>>>>> behavior. We shouldn't change that. > > >>>>>> > > >>>>>> *Respawning Threads* - I believe we should never re-spawn a thread. > > >> The > > >>>>>> correct approach in my mind is to either have it stay dead or never > > >> let > > >>>>> it > > >>>>>> die in the first place. > > >>>>>> > > >>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric > > is > > >>>>>> unneeded. Users can monitor the `uncleanable-partitions-count` > > metric > > >>>>> and > > >>>>>> inspect logs. > > >>>>>> > > >>>>>> > > >>>>>> Hey Ray, > > >>>>>> > > >>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner > > to > > >>>>>>> skip over problematic partitions instead of dying. > > >>>>>> I think we can do this for every exception that isn't `IOException`. > > >>>>> This > > >>>>>> will future-proof us against bugs in the system and potential other > > >>>>> errors. > > >>>>>> Protecting yourself against unexpected failures is always a good > > thing > > >>>>> in > > >>>>>> my mind, but I also think that protecting yourself against bugs in > > the > > >>>>>> software is sort of clunky. What does everybody think about this? > > >>>>>> > > >>>>>>> 4) The only improvement I can think of is that if such an > > >>>>>>> error occurs, then have the option (configuration setting?) to > > >> create a > > >>>>>>> <log_segment>.skip file (or something similar). > > >>>>>> This is a good suggestion. Have others also seen corruption be > > >> generally > > >>>>>> tied to the same segment? > > >>>>>> > > >>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhru...@confluent.io > > > > > >>>>> wrote: > > >>>>>>> For the cleaner thread specifically, I do not think respawning will > > >>>>> help at > > >>>>>>> all because we are more than likely to run into the same issue > > again > > >>>>> which > > >>>>>>> would end up crashing the cleaner. Retrying makes sense for > > transient > > >>>>>>> errors or when you believe some part of the system could have > > healed > > >>>>>>> itself, both of which I think are not true for the log cleaner. > > >>>>>>> > > >>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndg...@gmail.com> > > >>>>> wrote: > > >>>>>>>> <<<respawning threads is likely to make things worse, by putting > > you > > >>>>> in > > >>>>>>> an > > >>>>>>>> infinite loop which consumes resources and fires off continuous > > log > > >>>>>>>> messages. > > >>>>>>>> Hi Colin. In case it could be relevant, one way to mitigate this > > >>>>> effect > > >>>>>>> is > > >>>>>>>> to implement a backoff mechanism (if a second respawn is to occur > > >> then > > >>>>>>> wait > > >>>>>>>> for 1 minute before doing it; then if a third respawn is to occur > > >> wait > > >>>>>>> for > > >>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to > > >> some > > >>>>> max > > >>>>>>>> wait time). > > >>>>>>>> > > >>>>>>>> I have no opinion on whether respawn is appropriate or not in this > > >>>>>>> context, > > >>>>>>>> but a mitigation like the increasing backoff described above may > > be > > >>>>>>>> relevant in weighing the pros and cons. > > >>>>>>>> > > >>>>>>>> Ron > > >>>>>>>> > > >>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmcc...@apache.org> > > >>>>> wrote: > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: > > >>>>>>>>>> Hi Stanislav! Thanks for this KIP! > > >>>>>>>>>> > > >>>>>>>>>> I agree that it would be good if the LogCleaner were more > > tolerant > > >>>>> of > > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead. > > >>>>>>>>>> > > >>>>>>>>>> Things are better now than they used to be. We have the metric > > >>>>>>>>>> > > kafka.log:type=LogCleanerManager,name=time-since-last-run-ms > > >>>>>>>>>> which we can use to tell us if the threads are dead. And as of > > >>>>> 1.1.0, > > >>>>>>>> we > > >>>>>>>>>> have KIP-226, which allows you to restart the log cleaner > > thread, > > >>>>>>>>>> without requiring a broker restart. > > >>>>>>>>>> > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration > > >>>>>>>>>> < > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration > > >>>>>>>>>> I've only read about this, I haven't personally tried it. > > >>>>>>>>> Thanks for pointing this out, James! Stanislav, we should > > probably > > >>>>>>> add a > > >>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the > > >> KIP. > > >>>>>>>> Maybe > > >>>>>>>>> in the intro section? > > >>>>>>>>> > > >>>>>>>>> I think it's clear that requiring the users to manually restart > > the > > >>>>> log > > >>>>>>>>> cleaner is not a very good solution. But it's good to know that > > >>>>> it's a > > >>>>>>>>> possibility on some older releases. > > >>>>>>>>> > > >>>>>>>>>> Some comments: > > >>>>>>>>>> * I like the idea of having the log cleaner continue to clean as > > >>>>> many > > >>>>>>>>>> partitions as it can, skipping over the problematic ones if > > >>>>> possible. > > >>>>>>>>>> * If the log cleaner thread dies, I think it should > > automatically > > >> be > > >>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions > > >> during > > >>>>>>>>>> execution, but I think we should go all the way and make sure > > >> that a > > >>>>>>>> new > > >>>>>>>>>> one gets created, if the thread ever dies. > > >>>>>>>>> This is inconsistent with the way the rest of Kafka works. We > > >> don't > > >>>>>>>>> automatically re-create other threads in the broker if they > > >>>>> terminate. > > >>>>>>>> In > > >>>>>>>>> general, if there is a serious bug in the code, respawning > > threads > > >> is > > >>>>>>>>> likely to make things worse, by putting you in an infinite loop > > >> which > > >>>>>>>>> consumes resources and fires off continuous log messages. > > >>>>>>>>> > > >>>>>>>>>> * It might be worth trying to re-clean the uncleanable > > partitions. > > >>>>>>> I've > > >>>>>>>>>> seen cases where an uncleanable partition later became > > cleanable. > > >> I > > >>>>>>>>>> unfortunately don't remember how that happened, but I remember > > >> being > > >>>>>>>>>> surprised when I discovered it. It might have been something > > like > > >> a > > >>>>>>>>>> follower was uncleanable but after a leader election happened, > > the > > >>>>>>> log > > >>>>>>>>>> truncated and it was then cleanable again. I'm not sure. > > >>>>>>>>> James, I disagree. We had this behavior in the Hadoop > > Distributed > > >>>>> File > > >>>>>>>>> System (HDFS) and it was a constant source of user problems. > > >>>>>>>>> > > >>>>>>>>> What would happen is disks would just go bad over time. The > > >> DataNode > > >>>>>>>>> would notice this and take them offline. But then, due to some > > >>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add > > >> them > > >>>>>>> to > > >>>>>>>>> the system. Then one of two things would happen: the disk would > > >> just > > >>>>>>>> fail > > >>>>>>>>> immediately again, or it would appear to work and then fail > > after a > > >>>>>>> short > > >>>>>>>>> amount of time. > > >>>>>>>>> > > >>>>>>>>> The way the disk failed was normally having an I/O request take a > > >>>>>>> really > > >>>>>>>>> long time and time out. So a bunch of request handler threads > > >> would > > >>>>>>>>> basically slam into a brick wall when they tried to access the > > bad > > >>>>>>> disk, > > >>>>>>>>> slowing the DataNode to a crawl. It was even worse in the second > > >>>>>>>> scenario, > > >>>>>>>>> if the disk appeared to work for a while, but then failed. Any > > >> data > > >>>>>>> that > > >>>>>>>>> had been written on that DataNode to that disk would be lost, and > > >> we > > >>>>>>>> would > > >>>>>>>>> need to re-replicate it. > > >>>>>>>>> > > >>>>>>>>> Disks aren't biological systems-- they don't heal over time. > > Once > > >>>>>>>> they're > > >>>>>>>>> bad, they stay bad. The log cleaner needs to be robust against > > >> cases > > >>>>>>>> where > > >>>>>>>>> the disk really is failing, and really is returning bad data or > > >>>>> timing > > >>>>>>>> out. > > >>>>>>>>>> * For your metrics, can you spell out the full metric in > > JMX-style > > >>>>>>>>>> format, such as: > > >>>>>>>>>> > > >>>>>>>> > > kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count > > >>>>>>>>>> value=4 > > >>>>>>>>>> > > >>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be > > very > > >>>>>>> long. > > >>>>>>>>>> I think the current max size is 210 characters (or maybe > > >> 240-ish?). > > >>>>>>>>>> Having the "uncleanable-partitions" being a list could be very > > >> large > > >>>>>>>>>> metric. Also, having the metric come out as a csv might be > > >> difficult > > >>>>>>> to > > >>>>>>>>>> work with for monitoring systems. If we *did* want the topic > > names > > >>>>> to > > >>>>>>>> be > > >>>>>>>>>> accessible, what do you think of having the > > >>>>>>>>>> kafka.log:type=LogCleanerManager,topic=topic1,partition=2 > > >>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my > > >> example > > >>>>>>> was > > >>>>>>>>>> that the topic and partition can be tags in the metric. That > > will > > >>>>>>> allow > > >>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm > > >> not > > >>>>>>>>>> sure what the attribute for that metric would be. Maybe > > something > > >>>>>>> like > > >>>>>>>>>> "uncleaned bytes" for that topic-partition? Or > > >>>>> time-since-last-clean? > > >>>>>>>> Or > > >>>>>>>>>> maybe even just "Value=1". > > >>>>>>>>> I haven't though about this that hard, but do we really need the > > >>>>>>>>> uncleanable topic names to be accessible through a metric? It > > >> seems > > >>>>>>> like > > >>>>>>>>> the admin should notice that uncleanable partitions are present, > > >> and > > >>>>>>> then > > >>>>>>>>> check the logs? > > >>>>>>>>> > > >>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely > > >>>>>>>>>> indicates that the disk is having problems. I'm not sure that is > > >> the > > >>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner > > >> problems, > > >>>>>>>> all > > >>>>>>>>>> of them are partition-level scenarios that happened during > > normal > > >>>>>>>>>> operation. None of them were indicative of disk problems. > > >>>>>>>>> I don't think this is a meaningful comparison. In general, we > > >> don't > > >>>>>>>>> accept JIRAs for hard disk problems that happen on a particular > > >>>>>>> cluster. > > >>>>>>>>> If someone opened a JIRA that said "my hard disk is having > > >> problems" > > >>>>> we > > >>>>>>>>> could close that as "not a Kafka bug." This doesn't prove that > > >> disk > > >>>>>>>>> problems don't happen, but just that JIRA isn't the right place > > >> for > > >>>>>>>> them. > > >>>>>>>>> I do agree that the log cleaner has had a significant number of > > >> logic > > >>>>>>>>> bugs, and that we need to be careful to limit their impact. > > That's > > >>>>> one > > >>>>>>>>> reason why I think that a threshold of "number of uncleanable > > logs" > > >>>>> is > > >>>>>>> a > > >>>>>>>>> good idea, rather than just failing after one IOException. In > > all > > >>>>> the > > >>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, > > it > > >>>>> was > > >>>>>>>>> just one partition that had the issue. We also should increase > > >> test > > >>>>>>>>> coverage for the log cleaner. > > >>>>>>>>> > > >>>>>>>>>> * About marking disks as offline when exceeding a certain > > >> threshold, > > >>>>>>>>>> that actually increases the blast radius of log compaction > > >> failures. > > >>>>>>>>>> Currently, the uncleaned partitions are still readable and > > >> writable. > > >>>>>>>>>> Taking the disks offline would impact availability of the > > >>>>> uncleanable > > >>>>>>>>>> partitions, as well as impact all other partitions that are on > > the > > >>>>>>>> disk. > > >>>>>>>>> In general, when we encounter I/O errors, we take the disk > > >> partition > > >>>>>>>>> offline. This is spelled out in KIP-112 ( > > >>>>>>>>> > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD > > >>>>>>>>> ) : > > >>>>>>>>> > > >>>>>>>>>> - Broker assumes a log directory to be good after it starts, and > > >>>>> mark > > >>>>>>>>> log directory as > > >>>>>>>>>> bad once there is IOException when broker attempts to access > > (i.e. > > >>>>>>> read > > >>>>>>>>> or write) the log directory. > > >>>>>>>>>> - Broker will be offline if all log directories are bad. > > >>>>>>>>>> - Broker will stop serving replicas in any bad log directory. > > New > > >>>>>>>>> replicas will only be created > > >>>>>>>>>> on good log directory. > > >>>>>>>>> The behavior Stanislav is proposing for the log cleaner is > > actually > > >>>>>>> more > > >>>>>>>>> optimistic than what we do for regular broker I/O, since we will > > >>>>>>> tolerate > > >>>>>>>>> multiple IOExceptions, not just one. But it's generally > > >> consistent. > > >>>>>>>>> Ignoring errors is not. In any case, if you want to tolerate an > > >>>>>>>> unlimited > > >>>>>>>>> number of I/O errors, you can just set the threshold to an > > infinite > > >>>>>>> value > > >>>>>>>>> (although I think that would be a bad idea). > > >>>>>>>>> > > >>>>>>>>> best, > > >>>>>>>>> Colin > > >>>>>>>>> > > >>>>>>>>>> -James > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski < > > >>>>>>>>> stanis...@confluent.io> wrote: > > >>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that. > > >> Here > > >>>>>>>> is > > >>>>>>>>> the > > >>>>>>>>>>> new link: > > >>>>>>>>>>> > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error > > >>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski < > > >>>>>>>>> stanis...@confluent.io> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> Hey group, > > >>>>>>>>>>>> > > >>>>>>>>>>>> I created a new KIP about making log compaction more > > >>>>>>> fault-tolerant. > > >>>>>>>>>>>> Please give it a look here and please share what you think, > > >>>>>>>>> especially in > > >>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph. > > >>>>>>>>>>>> > > >>>>>>>>>>>> KIP: KIP-346 > > >>>>>>>>>>>> < > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure > > >>>>>>>>>>>> -- > > >>>>>>>>>>>> Best, > > >>>>>>>>>>>> Stanislav > > >>>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Stanislav > > >>>>> > > >>>> -- > > >>>> Best, > > >>>> Stanislav > > >>>> > > >>> > > >>> -- > > >>> Best, > > >>> Stanislav > > > > > > > > > -- > Best, > Stanislav