Hey Ray, Thanks for the explanation. In regards to the configuration property - I'm not sure. As long as it has sufficient documentation, I find "max.uncleanable.partitions" to be okay. If we were to add the distinction explicitly, maybe it should be `max.uncleanable.partitions.per.logdir` ?
On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rchi...@apache.org> wrote: > One more thing occurred to me. Should the configuration property be > named "max.uncleanable.partitions.per.disk" instead? > > -Ray > > > On 8/1/18 9:11 AM, Stanislav Kozlovski wrote: > > Yes, good catch. Thank you, James! > > > > Best, > > Stanislav > > > > On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wushuja...@gmail.com> wrote: > > > >> Can you update the KIP to say what the default is for > >> max.uncleanable.partitions? > >> > >> -James > >> > >> Sent from my iPhone > >> > >>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski < > stanis...@confluent.io> > >> wrote: > >>> Hey group, > >>> > >>> I am planning on starting a voting thread tomorrow. Please do reply if > >> you > >>> feel there is anything left to discuss. > >>> > >>> Best, > >>> Stanislav > >>> > >>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski < > >> stanis...@confluent.io> > >>> wrote: > >>> > >>>> Hey, Ray > >>>> > >>>> Thanks for pointing that out, it's fixed now > >>>> > >>>> Best, > >>>> Stanislav > >>>> > >>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org> > wrote: > >>>>> > >>>>> Thanks. Can you fix the link in the "KIPs under discussion" table on > >>>>> the main KIP landing page > >>>>> < > >>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals# > >>> ? > >>>>> I tried, but the Wiki won't let me. > >>>>> > >>>>> -Ray > >>>>> > >>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: > >>>>>> Hey guys, > >>>>>> > >>>>>> @Colin - good point. I added some sentences mentioning recent > >>>>> improvements > >>>>>> in the introductory section. > >>>>>> > >>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk > >>>>> fails, > >>>>>> you don't want to work with it again. As such, I've changed my mind > >> and > >>>>>> believe that we should mark the LogDir (assume its a disk) as > offline > >> on > >>>>>> the first `IOException` encountered. This is the LogCleaner's > current > >>>>>> behavior. We shouldn't change that. > >>>>>> > >>>>>> *Respawning Threads* - I believe we should never re-spawn a thread. > >> The > >>>>>> correct approach in my mind is to either have it stay dead or never > >> let > >>>>> it > >>>>>> die in the first place. > >>>>>> > >>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric > is > >>>>>> unneeded. Users can monitor the `uncleanable-partitions-count` > metric > >>>>> and > >>>>>> inspect logs. > >>>>>> > >>>>>> > >>>>>> Hey Ray, > >>>>>> > >>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner > to > >>>>>>> skip over problematic partitions instead of dying. > >>>>>> I think we can do this for every exception that isn't `IOException`. > >>>>> This > >>>>>> will future-proof us against bugs in the system and potential other > >>>>> errors. > >>>>>> Protecting yourself against unexpected failures is always a good > thing > >>>>> in > >>>>>> my mind, but I also think that protecting yourself against bugs in > the > >>>>>> software is sort of clunky. What does everybody think about this? > >>>>>> > >>>>>>> 4) The only improvement I can think of is that if such an > >>>>>>> error occurs, then have the option (configuration setting?) to > >> create a > >>>>>>> <log_segment>.skip file (or something similar). > >>>>>> This is a good suggestion. Have others also seen corruption be > >> generally > >>>>>> tied to the same segment? > >>>>>> > >>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhru...@confluent.io > > > >>>>> wrote: > >>>>>>> For the cleaner thread specifically, I do not think respawning will > >>>>> help at > >>>>>>> all because we are more than likely to run into the same issue > again > >>>>> which > >>>>>>> would end up crashing the cleaner. Retrying makes sense for > transient > >>>>>>> errors or when you believe some part of the system could have > healed > >>>>>>> itself, both of which I think are not true for the log cleaner. > >>>>>>> > >>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndg...@gmail.com> > >>>>> wrote: > >>>>>>>> <<<respawning threads is likely to make things worse, by putting > you > >>>>> in > >>>>>>> an > >>>>>>>> infinite loop which consumes resources and fires off continuous > log > >>>>>>>> messages. > >>>>>>>> Hi Colin. In case it could be relevant, one way to mitigate this > >>>>> effect > >>>>>>> is > >>>>>>>> to implement a backoff mechanism (if a second respawn is to occur > >> then > >>>>>>> wait > >>>>>>>> for 1 minute before doing it; then if a third respawn is to occur > >> wait > >>>>>>> for > >>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to > >> some > >>>>> max > >>>>>>>> wait time). > >>>>>>>> > >>>>>>>> I have no opinion on whether respawn is appropriate or not in this > >>>>>>> context, > >>>>>>>> but a mitigation like the increasing backoff described above may > be > >>>>>>>> relevant in weighing the pros and cons. > >>>>>>>> > >>>>>>>> Ron > >>>>>>>> > >>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmcc...@apache.org> > >>>>> wrote: > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: > >>>>>>>>>> Hi Stanislav! Thanks for this KIP! > >>>>>>>>>> > >>>>>>>>>> I agree that it would be good if the LogCleaner were more > tolerant > >>>>> of > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays dead. > >>>>>>>>>> > >>>>>>>>>> Things are better now than they used to be. We have the metric > >>>>>>>>>> > kafka.log:type=LogCleanerManager,name=time-since-last-run-ms > >>>>>>>>>> which we can use to tell us if the threads are dead. And as of > >>>>> 1.1.0, > >>>>>>>> we > >>>>>>>>>> have KIP-226, which allows you to restart the log cleaner > thread, > >>>>>>>>>> without requiring a broker restart. > >>>>>>>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration > >>>>>>>>>> < > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration > >>>>>>>>>> I've only read about this, I haven't personally tried it. > >>>>>>>>> Thanks for pointing this out, James! Stanislav, we should > probably > >>>>>>> add a > >>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the > >> KIP. > >>>>>>>> Maybe > >>>>>>>>> in the intro section? > >>>>>>>>> > >>>>>>>>> I think it's clear that requiring the users to manually restart > the > >>>>> log > >>>>>>>>> cleaner is not a very good solution. But it's good to know that > >>>>> it's a > >>>>>>>>> possibility on some older releases. > >>>>>>>>> > >>>>>>>>>> Some comments: > >>>>>>>>>> * I like the idea of having the log cleaner continue to clean as > >>>>> many > >>>>>>>>>> partitions as it can, skipping over the problematic ones if > >>>>> possible. > >>>>>>>>>> * If the log cleaner thread dies, I think it should > automatically > >> be > >>>>>>>>>> revived. Your KIP attempts to do that by catching exceptions > >> during > >>>>>>>>>> execution, but I think we should go all the way and make sure > >> that a > >>>>>>>> new > >>>>>>>>>> one gets created, if the thread ever dies. > >>>>>>>>> This is inconsistent with the way the rest of Kafka works. We > >> don't > >>>>>>>>> automatically re-create other threads in the broker if they > >>>>> terminate. > >>>>>>>> In > >>>>>>>>> general, if there is a serious bug in the code, respawning > threads > >> is > >>>>>>>>> likely to make things worse, by putting you in an infinite loop > >> which > >>>>>>>>> consumes resources and fires off continuous log messages. > >>>>>>>>> > >>>>>>>>>> * It might be worth trying to re-clean the uncleanable > partitions. > >>>>>>> I've > >>>>>>>>>> seen cases where an uncleanable partition later became > cleanable. > >> I > >>>>>>>>>> unfortunately don't remember how that happened, but I remember > >> being > >>>>>>>>>> surprised when I discovered it. It might have been something > like > >> a > >>>>>>>>>> follower was uncleanable but after a leader election happened, > the > >>>>>>> log > >>>>>>>>>> truncated and it was then cleanable again. I'm not sure. > >>>>>>>>> James, I disagree. We had this behavior in the Hadoop > Distributed > >>>>> File > >>>>>>>>> System (HDFS) and it was a constant source of user problems. > >>>>>>>>> > >>>>>>>>> What would happen is disks would just go bad over time. The > >> DataNode > >>>>>>>>> would notice this and take them offline. But then, due to some > >>>>>>>>> "optimistic" code, the DataNode would periodically try to re-add > >> them > >>>>>>> to > >>>>>>>>> the system. Then one of two things would happen: the disk would > >> just > >>>>>>>> fail > >>>>>>>>> immediately again, or it would appear to work and then fail > after a > >>>>>>> short > >>>>>>>>> amount of time. > >>>>>>>>> > >>>>>>>>> The way the disk failed was normally having an I/O request take a > >>>>>>> really > >>>>>>>>> long time and time out. So a bunch of request handler threads > >> would > >>>>>>>>> basically slam into a brick wall when they tried to access the > bad > >>>>>>> disk, > >>>>>>>>> slowing the DataNode to a crawl. It was even worse in the second > >>>>>>>> scenario, > >>>>>>>>> if the disk appeared to work for a while, but then failed. Any > >> data > >>>>>>> that > >>>>>>>>> had been written on that DataNode to that disk would be lost, and > >> we > >>>>>>>> would > >>>>>>>>> need to re-replicate it. > >>>>>>>>> > >>>>>>>>> Disks aren't biological systems-- they don't heal over time. > Once > >>>>>>>> they're > >>>>>>>>> bad, they stay bad. The log cleaner needs to be robust against > >> cases > >>>>>>>> where > >>>>>>>>> the disk really is failing, and really is returning bad data or > >>>>> timing > >>>>>>>> out. > >>>>>>>>>> * For your metrics, can you spell out the full metric in > JMX-style > >>>>>>>>>> format, such as: > >>>>>>>>>> > >>>>>>>> > kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count > >>>>>>>>>> value=4 > >>>>>>>>>> > >>>>>>>>>> * For "uncleanable-partitions": topic-partition names can be > very > >>>>>>> long. > >>>>>>>>>> I think the current max size is 210 characters (or maybe > >> 240-ish?). > >>>>>>>>>> Having the "uncleanable-partitions" being a list could be very > >> large > >>>>>>>>>> metric. Also, having the metric come out as a csv might be > >> difficult > >>>>>>> to > >>>>>>>>>> work with for monitoring systems. If we *did* want the topic > names > >>>>> to > >>>>>>>> be > >>>>>>>>>> accessible, what do you think of having the > >>>>>>>>>> kafka.log:type=LogCleanerManager,topic=topic1,partition=2 > >>>>>>>>>> I'm not sure if LogCleanerManager is the right type, but my > >> example > >>>>>>> was > >>>>>>>>>> that the topic and partition can be tags in the metric. That > will > >>>>>>> allow > >>>>>>>>>> monitoring systems to more easily slice and dice the metric. I'm > >> not > >>>>>>>>>> sure what the attribute for that metric would be. Maybe > something > >>>>>>> like > >>>>>>>>>> "uncleaned bytes" for that topic-partition? Or > >>>>> time-since-last-clean? > >>>>>>>> Or > >>>>>>>>>> maybe even just "Value=1". > >>>>>>>>> I haven't though about this that hard, but do we really need the > >>>>>>>>> uncleanable topic names to be accessible through a metric? It > >> seems > >>>>>>> like > >>>>>>>>> the admin should notice that uncleanable partitions are present, > >> and > >>>>>>> then > >>>>>>>>> check the logs? > >>>>>>>>> > >>>>>>>>>> * About `max.uncleanable.partitions`, you said that this likely > >>>>>>>>>> indicates that the disk is having problems. I'm not sure that is > >> the > >>>>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner > >> problems, > >>>>>>>> all > >>>>>>>>>> of them are partition-level scenarios that happened during > normal > >>>>>>>>>> operation. None of them were indicative of disk problems. > >>>>>>>>> I don't think this is a meaningful comparison. In general, we > >> don't > >>>>>>>>> accept JIRAs for hard disk problems that happen on a particular > >>>>>>> cluster. > >>>>>>>>> If someone opened a JIRA that said "my hard disk is having > >> problems" > >>>>> we > >>>>>>>>> could close that as "not a Kafka bug." This doesn't prove that > >> disk > >>>>>>>>> problems don't happen, but just that JIRA isn't the right place > >> for > >>>>>>>> them. > >>>>>>>>> I do agree that the log cleaner has had a significant number of > >> logic > >>>>>>>>> bugs, and that we need to be careful to limit their impact. > That's > >>>>> one > >>>>>>>>> reason why I think that a threshold of "number of uncleanable > logs" > >>>>> is > >>>>>>> a > >>>>>>>>> good idea, rather than just failing after one IOException. In > all > >>>>> the > >>>>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, > it > >>>>> was > >>>>>>>>> just one partition that had the issue. We also should increase > >> test > >>>>>>>>> coverage for the log cleaner. > >>>>>>>>> > >>>>>>>>>> * About marking disks as offline when exceeding a certain > >> threshold, > >>>>>>>>>> that actually increases the blast radius of log compaction > >> failures. > >>>>>>>>>> Currently, the uncleaned partitions are still readable and > >> writable. > >>>>>>>>>> Taking the disks offline would impact availability of the > >>>>> uncleanable > >>>>>>>>>> partitions, as well as impact all other partitions that are on > the > >>>>>>>> disk. > >>>>>>>>> In general, when we encounter I/O errors, we take the disk > >> partition > >>>>>>>>> offline. This is spelled out in KIP-112 ( > >>>>>>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD > >>>>>>>>> ) : > >>>>>>>>> > >>>>>>>>>> - Broker assumes a log directory to be good after it starts, and > >>>>> mark > >>>>>>>>> log directory as > >>>>>>>>>> bad once there is IOException when broker attempts to access > (i.e. > >>>>>>> read > >>>>>>>>> or write) the log directory. > >>>>>>>>>> - Broker will be offline if all log directories are bad. > >>>>>>>>>> - Broker will stop serving replicas in any bad log directory. > New > >>>>>>>>> replicas will only be created > >>>>>>>>>> on good log directory. > >>>>>>>>> The behavior Stanislav is proposing for the log cleaner is > actually > >>>>>>> more > >>>>>>>>> optimistic than what we do for regular broker I/O, since we will > >>>>>>> tolerate > >>>>>>>>> multiple IOExceptions, not just one. But it's generally > >> consistent. > >>>>>>>>> Ignoring errors is not. In any case, if you want to tolerate an > >>>>>>>> unlimited > >>>>>>>>> number of I/O errors, you can just set the threshold to an > infinite > >>>>>>> value > >>>>>>>>> (although I think that would be a bad idea). > >>>>>>>>> > >>>>>>>>> best, > >>>>>>>>> Colin > >>>>>>>>> > >>>>>>>>>> -James > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski < > >>>>>>>>> stanis...@confluent.io> wrote: > >>>>>>>>>>> I renamed the KIP and that changed the link. Sorry about that. > >> Here > >>>>>>>> is > >>>>>>>>> the > >>>>>>>>>>> new link: > >>>>>>>>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error > >>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski < > >>>>>>>>> stanis...@confluent.io> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hey group, > >>>>>>>>>>>> > >>>>>>>>>>>> I created a new KIP about making log compaction more > >>>>>>> fault-tolerant. > >>>>>>>>>>>> Please give it a look here and please share what you think, > >>>>>>>>> especially in > >>>>>>>>>>>> regards to the points in the "Needs Discussion" paragraph. > >>>>>>>>>>>> > >>>>>>>>>>>> KIP: KIP-346 > >>>>>>>>>>>> < > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure > >>>>>>>>>>>> -- > >>>>>>>>>>>> Best, > >>>>>>>>>>>> Stanislav > >>>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Best, > >>>>>>>>>>> Stanislav > >>>>> > >>>> -- > >>>> Best, > >>>> Stanislav > >>>> > >>> > >>> -- > >>> Best, > >>> Stanislav > > > > -- Best, Stanislav