Thanks for the reply Jason and Dhruvil. Yeah we don't need config for the sanity check and thus we don't need a KIP. I think we are on the same page of just skipping the sanity check of segments before the recovery offset. I will close the KIP and submit a patch for this.
On Wed, Jun 27, 2018 at 10:09 AM, Dhruvil Shah <dhru...@confluent.io> wrote: > +1 to what Jason said. We need a better long-term strategy for dealing with > corrupted log and index data, but the sanity checks we have do not > guarantee much in this regard. > > For now, we could do away with these index sanity checks in my opinion. We > could handle the missing index case at startup. I think we could have > missing index files only when users are upgrading from a version that did > not have a particular type of index to a version that does, or if the > operator physically deleted these files. Because these are rare scenarios, > having to recreate a missing index should typically not affect normal > startup time. > > - Dhruvil > > On Wed, Jun 27, 2018 at 8:47 AM Jason Gustafson <ja...@confluent.io> > wrote: > > > Hey Dong, > > > > > > So the main concern with the above approach is that, if for any reason > the > > > index files of inactive segment is deleted or corrupted, the broker > will > > > halt if there is only one log directory. This is different from the > > > existing behavior where the broker will rebuild the index for this > > inactive > > > segment before it can accept any request from consumer. Though we don't > > > have provide guarantee for segments already flushed to disk, this still > > > seems like a change in behavior for user. Maybe we don't have to worry > > > about this if we decide it is very rare, e.g. it happens only when > there > > is > > > disk error or when there is human error. > > > > > > I think we should probably still handle the case when an index file is > > missing during startup? But considering how weak the sanity check is, it > > seems fine to skip it. Also, could we just make this change without a > KIP? > > Adding a config to enable a wimpy sanity check seems unnecessary. > > > > One scenario that does come up with users is actual segment corruption, > > which is only detected by consumers that are validating CRCs. To fix it, > we > > have to manually delete the segments and force re-replication. It would > be > > helpful to have a config to enable deep checking on startup for > particular > > topics or partitions. This could also just be a separate tool though > > ("kafka-fsck" or something). > > > > Thinking longer term, I think we need a more systematic approach to > dealing > > with corruption, not just in index files, but in the segments as well. It > > might be nice, for example, if the consumer had a way to hint the broker > > that a particular offset is corrupt. The leader might then demote itself, > > for example, and try to repair. Lots to think through though. > > > > -Jason > > > > > > > > > > On Wed, Jun 27, 2018 at 12:29 AM, Dong Lin <lindon...@gmail.com> wrote: > > > > > So the main concern with the above approach is that, if for any reason > > the > > > index files of inactive segment is deleted or corrupted, the broker > will > > > halt if there is only one log directory. This is different from the > > > existing behavior where the broker will rebuild the index for this > > inactive > > > segment before it can accept any request from consumer. Though we don't > > > have provide guarantee for segments already flushed to disk, this still > > > seems like a change in behavior for user. Maybe we don't have to worry > > > about this if we decide it is very rare, e.g. it happens only when > there > > is > > > disk error or when there is human error. > > > > > > > > > > > > On Wed, Jun 27, 2018 at 12:04 AM, Dong Lin <lindon...@gmail.com> > wrote: > > > > > > > Hey Jason, > > > > > > > > Thanks for the comment! > > > > > > > > Your comment reminded me to read through Jay's comments and my reply > > > > again. It seems that I probably have not captured idea of Jay's > comment > > > > that says sanity check is not part of any formal guarantee we > provide. > > I > > > > probably should have thought about this comment more. Let me reply to > > > both > > > > yours and Jay's comment and see if I can understand you better. > > > > > > > > Here are some clarifications: > > > > - KIP does not intend to optimize recovery. It aims to optimize the > the > > > > sanity check when there is clean shutdown. > > > > - Sanity check only read the last entry of the index rather than the > > full > > > > index > > > > - We have already done data driven investigation though it is not > done > > > > using hprof or strace. The resulting rolling bounce time is > acceptable > > > now. > > > > If it appears to be an issue e.g. after more data then we may need to > > > > revisit this with more data driven investigation > > > > > > > > I agree with the following comments: > > > > - We should optimize the default behavior instead of adding a new > > config. > > > > - sanity check of the segments before recovery offset is not part of > > any > > > > formal guarantee and thus we probably can just skip it. > > > > > > > > So we are all leaning towards skipping the sanity check of all > segments > > > > before the recovery offset. This solution would be pretty > > straightforward > > > > to understand and implement. And I am sure it will give us all the > > > benefits > > > > that this KIP intends to achieve. Here is only one question to double > > > check: > > > > > > > > If consumer fetches from an inactive segment, broker will just use > the > > > > index of that inactive segment. If anything goes wrong, e.g. the > index > > > file > > > > is corrupted or the index file does not exist, then the broker will > > just > > > > consider it as IOException, mark the disk and the partitions on the > > disk > > > > offline and respond KafkaStorageException to consumer. Does this > sound > > > OK? > > > > One alternative solution is to let broker rebuild index. But this > > > > alternative solution is inconsistent with the idea that "sanity check > > is > > > not > > > > part of any formal guarantee" and it may tie up all request handler > > > > thread for rebuilding the indexed. > > > > > > > > > > > > If this solution sounds right, I will update the KIP accordingly. > > > > > > > > Thanks, > > > > Dong > > > > > > > > On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <ja...@confluent.io > > > > > > wrote: > > > > > > > >> Hey Dong, > > > >> > > > >> Sorry for being slow to catch up to this. > > > >> > > > >> I think the benefit of the sanity check seems a little dubious in > the > > > >> first > > > >> place. We detect garbage at the end of the index file, but that's > > about > > > >> it. > > > >> Is there any reason to think that corruption is more likely to occur > > > there > > > >> or any other reason to think this check is still beneficial for > > flushed > > > >> data? I assume we did the check because we presumed it was cheap, > but > > > >> perhaps the cost is adding up as the number of partitions grows. How > > > much > > > >> does startup time improve if we skip the sanity check for data > earlier > > > >> than > > > >> the recovery point? Does the lazy loading itself give some > additional > > > >> benefit beyond skipping the sanity check? As Jay mentions above, the > > > >> sanity > > > >> checks seem strictly speaking optional. We don't bother checking the > > > >> segments themselves for example. > > > >> > > > >> Thanks, > > > >> Jason > > > >> > > > >> > > > >> > > > >> > > > >> It probably still makes sense for segments beyond the recovery point > > > >> > > > >> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <lindon...@gmail.com> > > wrote: > > > >> > > > >> > Hey Jay, > > > >> > > > > >> > Yeah our existing sanity check only read the last entry in the > index > > > >> files. > > > >> > I must have miscommunicated if I previously said it was reading > the > > > full > > > >> > index. Broker appears to be spending a lot of time just to read > the > > > last > > > >> > entry of index files for every log segment. This is probably > because > > > OS > > > >> > will load a chunk of data that is much larger than the entry > itself > > > from > > > >> > disk to page cache. This KIP tries to make this part of operation > > > lazy. > > > >> I > > > >> > guess you are suggesting that we should just make the lazy loading > > the > > > >> > default behavior? > > > >> > > > > >> > Yes we currently require manual intervention if the log file is > > > >> corrupted, > > > >> > i.e. if two messages with the same offset are appended to the disk > > > >> > (KAFKA-6488). The sanity check on broker startup is a bit > different > > > >> since > > > >> > it deals with the corruption of index files (e.g. offset index, > time > > > >> index > > > >> > and snapshot files) instead of the log data. In this case if index > > > files > > > >> > are corrupted broker will automatically recover it by rebuilding > the > > > >> index > > > >> > files using data in the log files, without requiring manual > > > >> intervention. > > > >> > Thus the design question is whether this should be done before > > broker > > > >> can > > > >> > become leader for any partitions -- there is tradeoff between > broker > > > >> > startup time and risk of delaying user requests if broker need to > > > >> rebuild > > > >> > index files when it is already leader. I prefer lazy loading to > > reduce > > > >> > broker startup time. Not sure what are the feedback from the > > community > > > >> on > > > >> > this issue. > > > >> > > > > >> > Thanks, > > > >> > Dong > > > >> > > > > >> > > > > >> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <j...@confluent.io> > > wrote: > > > >> > > > > >> > > Hey Dong, > > > >> > > > > > >> > > Makes total sense. What I'm saying is I don't think that the > > sanity > > > >> check > > > >> > > is part of any formal guarantee we provide. It is true that > > > >> corruption of > > > >> > > data flushed to disk will be a potential problem, but I don't > > think > > > >> the > > > >> > > sanity check solves that it just has a couple heuristics to help > > > >> detect > > > >> > > certain possible instances of it, right? In general I think our > > > >> > assumption > > > >> > > has been that flushed data doesn't disappear or get corrupted > and > > if > > > >> it > > > >> > > does you need to manually intervene. I don't think people want > to > > > >> > configure > > > >> > > things at this level so what I was suggesting was understanding > > why > > > >> the > > > >> > > sanity check is slow and trying to avoid that rather than making > > it > > > >> > > configurable. I think you mentioned it was reading the full > index > > > into > > > >> > > memory. Based on the performance you describe this could be > true, > > > but > > > >> it > > > >> > > definitely should not be reading anything but the last entry in > > the > > > >> index > > > >> > > so that would be a bug. That read also happens in sanityCheck() > > only > > > >> in > > > >> > the > > > >> > > time-based index right? In the offset index we do the same read > > but > > > it > > > >> > > happens in initialization. If that read is the slow thing it > might > > > >> make > > > >> > > sense to try to remove it or make it lazy in both cases. If it > is > > > some > > > >> > > other part of the code then (e.g. the size check) then that may > be > > > >> able > > > >> > to > > > >> > > be avoided entirely (I think by the time we sanity check we > > already > > > >> know > > > >> > > the file size from the mapping...). That was what I meant by > doing > > > >> some > > > >> > > data driven analysis. Maybe a quick run with hprof would help > > > >> determine > > > >> > the > > > >> > > root cause of why sanityCheck is slow? > > > >> > > > > > >> > > -Jay > > > >> > > > > > >> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <lindon...@gmail.com> > > > >> wrote: > > > >> > > > > > >> > > > Hey Jay, > > > >> > > > > > > >> > > > Thanks for your comments! > > > >> > > > > > > >> > > > Yeah recovery is different from the sanity check. They are > > > >> correlated > > > >> > in > > > >> > > > the sense that there may still be corrupted index files even > > after > > > >> > clean > > > >> > > > broker shutdown. And in that case if we delay the sanity check > > > then > > > >> we > > > >> > > may > > > >> > > > delay the log recovery. The main goal of this KIP is to > optimize > > > the > > > >> > > sanity > > > >> > > > check related work so that it does not delay the broker > startup > > > >> much. > > > >> > > > > > > >> > > > The KIP mentioned that the sanity check is done using log > > recovery > > > >> > > > background thread. The name "recovery" is mentioned mainly > > because > > > >> the > > > >> > > > background thread number is determined using the existing > > > >> > > > config num.recovery.threads.per.data.dir. I have updated the > KIP > > > to > > > >> > make > > > >> > > > this less confusing. > > > >> > > > > > > >> > > > It makes a ton of sense to optimize the broker startup time > in a > > > >> data > > > >> > > > driven fashion. The currently optimize is done kind of in this > > > >> fashion. > > > >> > > The > > > >> > > > broker log shows that LogManager.loadLogs() takes a long time > in > > > >> large > > > >> > > > clusters. Then I started broker with cold cache and repeatedly > > get > > > >> > thread > > > >> > > > dump to see what are broker threads are doing during > > > >> > > LogManager.loadLogs(). > > > >> > > > Most of the threads are working on sanityCheck() and this > > > motivates > > > >> the > > > >> > > > change in this KIP. Previously broker shutdown time was > > > investigated > > > >> > in a > > > >> > > > similar data driven fashion and optimized with KAFKA-6172 and > > > >> > KAFKA-6175. > > > >> > > > It seems that the current KIP can reduces the rolling bounce > > time > > > >> of a > > > >> > > > large cluster by 50% -- there may be room for further > > improvement > > > >> but > > > >> > > maybe > > > >> > > > those do not require as big a change (with the caveat > described > > in > > > >> the > > > >> > > KIP) > > > >> > > > as suggested in this KIP. > > > >> > > > > > > >> > > > It is not clear whether it is safe to just read the latest > > segment > > > >> > > without > > > >> > > > sanity checking all previous inactive segment of a given > > partition > > > >> if > > > >> > > > transaction is used. Otherwise we probably want to always skip > > the > > > >> > sanity > > > >> > > > check of inactive segments without introducing a new config. > > Maybe > > > >> the > > > >> > > > developers familiar with the transaction can comment on that? > > > >> > > > > > > >> > > > Thanks, > > > >> > > > Dong > > > >> > > > > > > >> > > > > > > >> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <j...@confluent.io> > > > >> wrote: > > > >> > > > > > > >> > > > > Optimizing startup seems really valuable but I'm a little > > > >> confused by > > > >> > > > this. > > > >> > > > > > > > >> > > > > There are two different things: > > > >> > > > > 1. Recovery > > > >> > > > > 2. Sanity check > > > >> > > > > > > > >> > > > > The terminology we're using is a bit mixed here. > > > >> > > > > > > > >> > > > > Recovery means checksumming the log segments and rebuilding > > the > > > >> index > > > >> > > on > > > >> > > > a > > > >> > > > > hard crash. This only happens on unflushed segments, which > is > > > >> > generally > > > >> > > > > just the last segment. Recovery is essential for the > > correctness > > > >> > > > guarantees > > > >> > > > > of the log and you shouldn't disable it. It only happens on > > hard > > > >> > crash > > > >> > > > and > > > >> > > > > is not a factor in graceful restart. We can likely optimize > it > > > but > > > >> > that > > > >> > > > > would make most sense to do in a data driven fashion off > some > > > >> > > profiling. > > > >> > > > > > > > >> > > > > However there is also a ton of disk activity that happens > > during > > > >> > > > > initialization (lots of checks on the file size, absolute > > path, > > > >> > etc). I > > > >> > > > > think these have crept in over time with people not really > > > >> realizing > > > >> > > this > > > >> > > > > code is perf sensitive and java hiding a lot of what is and > > > isn't > > > >> a > > > >> > > file > > > >> > > > > operation. One part of this is the sanityCheck() call for > the > > > two > > > >> > > > indexes. > > > >> > > > > I don't think this call reads the full index, just the last > > > entry > > > >> in > > > >> > > the > > > >> > > > > index, right?. There should be no need to read the full > index > > > >> except > > > >> > > > during > > > >> > > > > recovery (and then only for the segments being recovered). I > > > >> think it > > > >> > > > would > > > >> > > > > make a ton of sense to optimize this but I don't think that > > > >> > > optimization > > > >> > > > > needs to be configurable as this is just a helpful sanity > > check > > > to > > > >> > > detect > > > >> > > > > common non-sensical things in the index files, but it isn't > > part > > > >> of > > > >> > the > > > >> > > > > core guarantees, in general you aren't supposed to lose > > > committed > > > >> > data > > > >> > > > from > > > >> > > > > disk, and if you do we may be able to fail faster but we > > > >> > fundamentally > > > >> > > > > can't really help you. Again I think this would make the > most > > > >> sense > > > >> > to > > > >> > > do > > > >> > > > > in a data driven way, if you look at that code I think it is > > > doing > > > >> > > crazy > > > >> > > > > amounts of file operations (e.g. getAbsolutePath, file > sizes, > > > >> etc). I > > > >> > > > think > > > >> > > > > it'd make most sense to profile startup with a cold cash on > a > > > >> large > > > >> > log > > > >> > > > > directory and do the same with an strace to see how many > > > redundant > > > >> > > system > > > >> > > > > calls we do per segment and what is costing us and then cut > > some > > > >> of > > > >> > > this > > > >> > > > > out. I suspect we could speed up our startup time quite a > lot > > if > > > >> we > > > >> > did > > > >> > > > > that. > > > >> > > > > > > > >> > > > > For example we have a bunch of calls like this: > > > >> > > > > > > > >> > > > > require(len % entrySize == 0, > > > >> > > > > > > > >> > > > > "Index file " + file.getAbsolutePath + " is > > corrupt, > > > >> > found > > > >> > > " > > > >> > > > + > > > >> > > > > len + > > > >> > > > > > > > >> > > > > " bytes which is not positive or not a multiple > of > > > >> 8.") > > > >> > > > > I'm pretty such file.getAbsolutePath is a system call and I > > > assume > > > >> > that > > > >> > > > > happens whether or not you fail the in-memory check? > > > >> > > > > > > > >> > > > > -Jay > > > >> > > > > > > > >> > > > > > > > >> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin < > > lindon...@gmail.com > > > > > > > >> > > wrote: > > > >> > > > > > > > >> > > > > > Hi all, > > > >> > > > > > > > > >> > > > > > I have created KIP-263: Allow broker to skip sanity check > of > > > >> > inactive > > > >> > > > > > segments on broker startup. See > > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > >> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+ > > > >> > > > > > segments+on+broker+startup > > > >> > > > > > . > > > >> > > > > > > > > >> > > > > > This KIP provides a way to significantly reduce time to > > > rolling > > > >> > > bounce > > > >> > > > a > > > >> > > > > > Kafka cluster. > > > >> > > > > > > > > >> > > > > > Comments are welcome! > > > >> > > > > > > > > >> > > > > > Thanks, > > > >> > > > > > Dong > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > >