Yes, good catch. Thank you, James! Best, Stanislav
On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wushuja...@gmail.com> wrote: > Can you update the KIP to say what the default is for > max.uncleanable.partitions? > > -James > > Sent from my iPhone > > > On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <stanis...@confluent.io> > wrote: > > > > Hey group, > > > > I am planning on starting a voting thread tomorrow. Please do reply if > you > > feel there is anything left to discuss. > > > > Best, > > Stanislav > > > > On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski < > stanis...@confluent.io> > > wrote: > > > >> Hey, Ray > >> > >> Thanks for pointing that out, it's fixed now > >> > >> Best, > >> Stanislav > >> > >>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchi...@apache.org> wrote: > >>> > >>> Thanks. Can you fix the link in the "KIPs under discussion" table on > >>> the main KIP landing page > >>> < > >>> > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals# > >? > >>> > >>> I tried, but the Wiki won't let me. > >>> > >>> -Ray > >>> > >>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: > >>>> Hey guys, > >>>> > >>>> @Colin - good point. I added some sentences mentioning recent > >>> improvements > >>>> in the introductory section. > >>>> > >>>> *Disk Failure* - I tend to agree with what Colin said - once a disk > >>> fails, > >>>> you don't want to work with it again. As such, I've changed my mind > and > >>>> believe that we should mark the LogDir (assume its a disk) as offline > on > >>>> the first `IOException` encountered. This is the LogCleaner's current > >>>> behavior. We shouldn't change that. > >>>> > >>>> *Respawning Threads* - I believe we should never re-spawn a thread. > The > >>>> correct approach in my mind is to either have it stay dead or never > let > >>> it > >>>> die in the first place. > >>>> > >>>> *Uncleanable-partition-names metric* - Colin is right, this metric is > >>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric > >>> and > >>>> inspect logs. > >>>> > >>>> > >>>> Hey Ray, > >>>> > >>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner to > >>>>> skip over problematic partitions instead of dying. > >>>> I think we can do this for every exception that isn't `IOException`. > >>> This > >>>> will future-proof us against bugs in the system and potential other > >>> errors. > >>>> Protecting yourself against unexpected failures is always a good thing > >>> in > >>>> my mind, but I also think that protecting yourself against bugs in the > >>>> software is sort of clunky. What does everybody think about this? > >>>> > >>>>> 4) The only improvement I can think of is that if such an > >>>>> error occurs, then have the option (configuration setting?) to > create a > >>>>> <log_segment>.skip file (or something similar). > >>>> This is a good suggestion. Have others also seen corruption be > generally > >>>> tied to the same segment? > >>>> > >>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhru...@confluent.io> > >>> wrote: > >>>> > >>>>> For the cleaner thread specifically, I do not think respawning will > >>> help at > >>>>> all because we are more than likely to run into the same issue again > >>> which > >>>>> would end up crashing the cleaner. Retrying makes sense for transient > >>>>> errors or when you believe some part of the system could have healed > >>>>> itself, both of which I think are not true for the log cleaner. > >>>>> > >>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndg...@gmail.com> > >>> wrote: > >>>>> > >>>>>> <<<respawning threads is likely to make things worse, by putting you > >>> in > >>>>> an > >>>>>> infinite loop which consumes resources and fires off continuous log > >>>>>> messages. > >>>>>> Hi Colin. In case it could be relevant, one way to mitigate this > >>> effect > >>>>> is > >>>>>> to implement a backoff mechanism (if a second respawn is to occur > then > >>>>> wait > >>>>>> for 1 minute before doing it; then if a third respawn is to occur > wait > >>>>> for > >>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up to > some > >>> max > >>>>>> wait time). > >>>>>> > >>>>>> I have no opinion on whether respawn is appropriate or not in this > >>>>> context, > >>>>>> but a mitigation like the increasing backoff described above may be > >>>>>> relevant in weighing the pros and cons. > >>>>>> > >>>>>> Ron > >>>>>> > >>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmcc...@apache.org> > >>> wrote: > >>>>>> > >>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: > >>>>>>>> Hi Stanislav! Thanks for this KIP! > >>>>>>>> > >>>>>>>> I agree that it would be good if the LogCleaner were more tolerant > >>> of > >>>>>>>> errors. Currently, as you said, once it dies, it stays dead. > >>>>>>>> > >>>>>>>> Things are better now than they used to be. We have the metric > >>>>>>>> kafka.log:type=LogCleanerManager,name=time-since-last-run-ms > >>>>>>>> which we can use to tell us if the threads are dead. And as of > >>> 1.1.0, > >>>>>> we > >>>>>>>> have KIP-226, which allows you to restart the log cleaner thread, > >>>>>>>> without requiring a broker restart. > >>>>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration > >>>>>>>> < > >>>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration > >>>>>>> > >>>>>>>> I've only read about this, I haven't personally tried it. > >>>>>>> Thanks for pointing this out, James! Stanislav, we should probably > >>>>> add a > >>>>>>> sentence or two mentioning the KIP-226 changes somewhere in the > KIP. > >>>>>> Maybe > >>>>>>> in the intro section? > >>>>>>> > >>>>>>> I think it's clear that requiring the users to manually restart the > >>> log > >>>>>>> cleaner is not a very good solution. But it's good to know that > >>> it's a > >>>>>>> possibility on some older releases. > >>>>>>> > >>>>>>>> Some comments: > >>>>>>>> * I like the idea of having the log cleaner continue to clean as > >>> many > >>>>>>>> partitions as it can, skipping over the problematic ones if > >>> possible. > >>>>>>>> > >>>>>>>> * If the log cleaner thread dies, I think it should automatically > be > >>>>>>>> revived. Your KIP attempts to do that by catching exceptions > during > >>>>>>>> execution, but I think we should go all the way and make sure > that a > >>>>>> new > >>>>>>>> one gets created, if the thread ever dies. > >>>>>>> This is inconsistent with the way the rest of Kafka works. We > don't > >>>>>>> automatically re-create other threads in the broker if they > >>> terminate. > >>>>>> In > >>>>>>> general, if there is a serious bug in the code, respawning threads > is > >>>>>>> likely to make things worse, by putting you in an infinite loop > which > >>>>>>> consumes resources and fires off continuous log messages. > >>>>>>> > >>>>>>>> * It might be worth trying to re-clean the uncleanable partitions. > >>>>> I've > >>>>>>>> seen cases where an uncleanable partition later became cleanable. > I > >>>>>>>> unfortunately don't remember how that happened, but I remember > being > >>>>>>>> surprised when I discovered it. It might have been something like > a > >>>>>>>> follower was uncleanable but after a leader election happened, the > >>>>> log > >>>>>>>> truncated and it was then cleanable again. I'm not sure. > >>>>>>> James, I disagree. We had this behavior in the Hadoop Distributed > >>> File > >>>>>>> System (HDFS) and it was a constant source of user problems. > >>>>>>> > >>>>>>> What would happen is disks would just go bad over time. The > DataNode > >>>>>>> would notice this and take them offline. But then, due to some > >>>>>>> "optimistic" code, the DataNode would periodically try to re-add > them > >>>>> to > >>>>>>> the system. Then one of two things would happen: the disk would > just > >>>>>> fail > >>>>>>> immediately again, or it would appear to work and then fail after a > >>>>> short > >>>>>>> amount of time. > >>>>>>> > >>>>>>> The way the disk failed was normally having an I/O request take a > >>>>> really > >>>>>>> long time and time out. So a bunch of request handler threads > would > >>>>>>> basically slam into a brick wall when they tried to access the bad > >>>>> disk, > >>>>>>> slowing the DataNode to a crawl. It was even worse in the second > >>>>>> scenario, > >>>>>>> if the disk appeared to work for a while, but then failed. Any > data > >>>>> that > >>>>>>> had been written on that DataNode to that disk would be lost, and > we > >>>>>> would > >>>>>>> need to re-replicate it. > >>>>>>> > >>>>>>> Disks aren't biological systems-- they don't heal over time. Once > >>>>>> they're > >>>>>>> bad, they stay bad. The log cleaner needs to be robust against > cases > >>>>>> where > >>>>>>> the disk really is failing, and really is returning bad data or > >>> timing > >>>>>> out. > >>>>>>>> * For your metrics, can you spell out the full metric in JMX-style > >>>>>>>> format, such as: > >>>>>>>> > >>>>>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count > >>>>>>>> value=4 > >>>>>>>> > >>>>>>>> * For "uncleanable-partitions": topic-partition names can be very > >>>>> long. > >>>>>>>> I think the current max size is 210 characters (or maybe > 240-ish?). > >>>>>>>> Having the "uncleanable-partitions" being a list could be very > large > >>>>>>>> metric. Also, having the metric come out as a csv might be > difficult > >>>>> to > >>>>>>>> work with for monitoring systems. If we *did* want the topic names > >>> to > >>>>>> be > >>>>>>>> accessible, what do you think of having the > >>>>>>>> kafka.log:type=LogCleanerManager,topic=topic1,partition=2 > >>>>>>>> I'm not sure if LogCleanerManager is the right type, but my > example > >>>>> was > >>>>>>>> that the topic and partition can be tags in the metric. That will > >>>>> allow > >>>>>>>> monitoring systems to more easily slice and dice the metric. I'm > not > >>>>>>>> sure what the attribute for that metric would be. Maybe something > >>>>> like > >>>>>>>> "uncleaned bytes" for that topic-partition? Or > >>> time-since-last-clean? > >>>>>> Or > >>>>>>>> maybe even just "Value=1". > >>>>>>> I haven't though about this that hard, but do we really need the > >>>>>>> uncleanable topic names to be accessible through a metric? It > seems > >>>>> like > >>>>>>> the admin should notice that uncleanable partitions are present, > and > >>>>> then > >>>>>>> check the logs? > >>>>>>> > >>>>>>>> * About `max.uncleanable.partitions`, you said that this likely > >>>>>>>> indicates that the disk is having problems. I'm not sure that is > the > >>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner > problems, > >>>>>> all > >>>>>>>> of them are partition-level scenarios that happened during normal > >>>>>>>> operation. None of them were indicative of disk problems. > >>>>>>> I don't think this is a meaningful comparison. In general, we > don't > >>>>>>> accept JIRAs for hard disk problems that happen on a particular > >>>>> cluster. > >>>>>>> If someone opened a JIRA that said "my hard disk is having > problems" > >>> we > >>>>>>> could close that as "not a Kafka bug." This doesn't prove that > disk > >>>>>>> problems don't happen, but just that JIRA isn't the right place > for > >>>>>> them. > >>>>>>> I do agree that the log cleaner has had a significant number of > logic > >>>>>>> bugs, and that we need to be careful to limit their impact. That's > >>> one > >>>>>>> reason why I think that a threshold of "number of uncleanable logs" > >>> is > >>>>> a > >>>>>>> good idea, rather than just failing after one IOException. In all > >>> the > >>>>>>> cases I've seen where a user hit a logic bug in the log cleaner, it > >>> was > >>>>>>> just one partition that had the issue. We also should increase > test > >>>>>>> coverage for the log cleaner. > >>>>>>> > >>>>>>>> * About marking disks as offline when exceeding a certain > threshold, > >>>>>>>> that actually increases the blast radius of log compaction > failures. > >>>>>>>> Currently, the uncleaned partitions are still readable and > writable. > >>>>>>>> Taking the disks offline would impact availability of the > >>> uncleanable > >>>>>>>> partitions, as well as impact all other partitions that are on the > >>>>>> disk. > >>>>>>> In general, when we encounter I/O errors, we take the disk > partition > >>>>>>> offline. This is spelled out in KIP-112 ( > >>>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD > >>>>>>> ) : > >>>>>>> > >>>>>>>> - Broker assumes a log directory to be good after it starts, and > >>> mark > >>>>>>> log directory as > >>>>>>>> bad once there is IOException when broker attempts to access (i.e. > >>>>> read > >>>>>>> or write) the log directory. > >>>>>>>> - Broker will be offline if all log directories are bad. > >>>>>>>> - Broker will stop serving replicas in any bad log directory. New > >>>>>>> replicas will only be created > >>>>>>>> on good log directory. > >>>>>>> The behavior Stanislav is proposing for the log cleaner is actually > >>>>> more > >>>>>>> optimistic than what we do for regular broker I/O, since we will > >>>>> tolerate > >>>>>>> multiple IOExceptions, not just one. But it's generally > consistent. > >>>>>>> Ignoring errors is not. In any case, if you want to tolerate an > >>>>>> unlimited > >>>>>>> number of I/O errors, you can just set the threshold to an infinite > >>>>> value > >>>>>>> (although I think that would be a bad idea). > >>>>>>> > >>>>>>> best, > >>>>>>> Colin > >>>>>>> > >>>>>>>> -James > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski < > >>>>>>> stanis...@confluent.io> wrote: > >>>>>>>>> I renamed the KIP and that changed the link. Sorry about that. > Here > >>>>>> is > >>>>>>> the > >>>>>>>>> new link: > >>>>>>>>> > >>>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error > >>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski < > >>>>>>> stanis...@confluent.io> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hey group, > >>>>>>>>>> > >>>>>>>>>> I created a new KIP about making log compaction more > >>>>> fault-tolerant. > >>>>>>>>>> Please give it a look here and please share what you think, > >>>>>>> especially in > >>>>>>>>>> regards to the points in the "Needs Discussion" paragraph. > >>>>>>>>>> > >>>>>>>>>> KIP: KIP-346 > >>>>>>>>>> < > >>>>> > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure > >>>>>>>>>> -- > >>>>>>>>>> Best, > >>>>>>>>>> Stanislav > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Best, > >>>>>>>>> Stanislav > >>>> > >>> > >>> > >> > >> -- > >> Best, > >> Stanislav > >> > > > > > > -- > > Best, > > Stanislav > -- Best, Stanislav