Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Dong Lin Wed, 27 Jun 2018 11:00:02 -0700

Thanks for the reply Jason and Dhruvil.

Yeah we don't need config for the sanity check and thus we don't need a
KIP. I think we are on the same page of just skipping the sanity check of
segments before the recovery offset. I will close the KIP and submit a
patch for this.


On Wed, Jun 27, 2018 at 10:09 AM, Dhruvil Shah <dhru...@confluent.io> wrote:

> +1 to what Jason said. We need a better long-term strategy for dealing with
> corrupted log and index data, but the sanity checks we have do not
> guarantee much in this regard.
>
> For now, we could do away with these index sanity checks in my opinion. We
> could handle the missing index case at startup. I think we could have
> missing index files only when users are upgrading from a version that did
> not have a particular type of index to a version that does, or if the
> operator physically deleted these files. Because these are rare scenarios,
> having to recreate a missing index should typically not affect normal
> startup time.
>
> - Dhruvil
>
> On Wed, Jun 27, 2018 at 8:47 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Dong,
> >
> >
> > So the main concern with the above approach is that, if for any reason
> the
> > > index files of inactive segment is deleted or corrupted, the broker
> will
> > > halt if there is only one log directory. This is different from the
> > > existing behavior where the broker will rebuild the index for this
> > inactive
> > > segment before it can accept any request from consumer. Though we don't
> > > have provide guarantee for segments already flushed to disk, this still
> > > seems like a change in behavior for user. Maybe we don't have to worry
> > > about this if we decide it is very rare, e.g. it happens only when
> there
> > is
> > > disk error or when there is human error.
> >
> >
> > I think we should probably still handle the case when an index file is
> > missing during startup? But considering how weak the sanity check is, it
> > seems fine to skip it.  Also, could we just make this change without a
> KIP?
> > Adding a config to enable a wimpy sanity check seems unnecessary.
> >
> > One scenario that does come up with users is actual segment corruption,
> > which is only detected by consumers that are validating CRCs. To fix it,
> we
> > have to manually delete the segments and force re-replication. It would
> be
> > helpful to have a config to enable deep checking on startup for
> particular
> > topics or partitions. This could also just be a separate tool though
> > ("kafka-fsck" or something).
> >
> > Thinking longer term, I think we need a more systematic approach to
> dealing
> > with corruption, not just in index files, but in the segments as well. It
> > might be nice, for example, if the consumer had a way to hint the broker
> > that a particular offset is corrupt. The leader might then demote itself,
> > for example, and try to repair. Lots to think through though.
> >
> > -Jason
> >
> >
> >
> >
> > On Wed, Jun 27, 2018 at 12:29 AM, Dong Lin <lindon...@gmail.com> wrote:
> >
> > > So the main concern with the above approach is that, if for any reason
> > the
> > > index files of inactive segment is deleted or corrupted, the broker
> will
> > > halt if there is only one log directory. This is different from the
> > > existing behavior where the broker will rebuild the index for this
> > inactive
> > > segment before it can accept any request from consumer. Though we don't
> > > have provide guarantee for segments already flushed to disk, this still
> > > seems like a change in behavior for user. Maybe we don't have to worry
> > > about this if we decide it is very rare, e.g. it happens only when
> there
> > is
> > > disk error or when there is human error.
> > >
> > >
> > >
> > > On Wed, Jun 27, 2018 at 12:04 AM, Dong Lin <lindon...@gmail.com>
> wrote:
> > >
> > > > Hey Jason,
> > > >
> > > > Thanks for the comment!
> > > >
> > > > Your comment reminded me to read through Jay's comments and my reply
> > > > again. It seems that I probably have not captured idea of Jay's
> comment
> > > > that says sanity check is not part of any formal guarantee we
> provide.
> > I
> > > > probably should have thought about this comment more. Let me reply to
> > > both
> > > > yours and Jay's comment and see if I can understand you better.
> > > >
> > > > Here are some clarifications:
> > > > - KIP does not intend to optimize recovery. It aims to optimize the
> the
> > > > sanity check when there is clean shutdown.
> > > > - Sanity check only read the last entry of the index rather than the
> > full
> > > > index
> > > > - We have already done data driven investigation though it is not
> done
> > > > using hprof or strace. The resulting rolling bounce time is
> acceptable
> > > now.
> > > > If it appears to be an issue e.g. after more data then we may need to
> > > > revisit this with more data driven investigation
> > > >
> > > > I agree with the following comments:
> > > > - We should optimize the default behavior instead of adding a new
> > config.
> > > > - sanity check of the segments before recovery offset is not part of
> > any
> > > > formal guarantee and thus we probably can just skip it.
> > > >
> > > > So we are all leaning towards skipping the sanity check of all
> segments
> > > > before the recovery offset. This solution would be pretty
> > straightforward
> > > > to understand and implement. And I am sure it will give us all the
> > > benefits
> > > > that this KIP intends to achieve. Here is only one question to double
> > > check:
> > > >
> > > > If consumer fetches from an inactive segment, broker will just use
> the
> > > > index of that inactive segment. If anything goes wrong, e.g. the
> index
> > > file
> > > > is corrupted or the index file does not exist, then the broker will
> > just
> > > > consider it as IOException, mark the disk and the partitions on the
> > disk
> > > > offline and respond KafkaStorageException to consumer. Does this
> sound
> > > OK?
> > > > One alternative solution is to let broker rebuild index. But this
> > > > alternative solution is inconsistent with the idea that "sanity check
> > is
> > > not
> > > > part of any formal guarantee" and it may tie up all request handler
> > > > thread for rebuilding the indexed.
> > > >
> > > >
> > > > If this solution sounds right, I will update the KIP accordingly.
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > > > On Tue, Jun 26, 2018 at 3:23 PM, Jason Gustafson <ja...@confluent.io
> >
> > > > wrote:
> > > >
> > > >> Hey Dong,
> > > >>
> > > >> Sorry for being slow to catch up to this.
> > > >>
> > > >> I think the benefit of the sanity check seems a little dubious in
> the
> > > >> first
> > > >> place. We detect garbage at the end of the index file, but that's
> > about
> > > >> it.
> > > >> Is there any reason to think that corruption is more likely to occur
> > > there
> > > >> or any other reason to think this check is still beneficial for
> > flushed
> > > >> data? I assume we did the check because we presumed it was cheap,
> but
> > > >> perhaps the cost is adding up as the number of partitions grows. How
> > > much
> > > >> does startup time improve if we skip the sanity check for data
> earlier
> > > >> than
> > > >> the recovery point? Does the lazy loading itself give some
> additional
> > > >> benefit beyond skipping the sanity check? As Jay mentions above, the
> > > >> sanity
> > > >> checks seem strictly speaking optional. We don't bother checking the
> > > >> segments themselves for example.
> > > >>
> > > >> Thanks,
> > > >> Jason
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> It probably still makes sense for segments beyond the recovery point
> > > >>
> > > >> On Wed, Mar 21, 2018 at 9:59 PM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > > >>
> > > >> > Hey Jay,
> > > >> >
> > > >> > Yeah our existing sanity check only read the last entry in the
> index
> > > >> files.
> > > >> > I must have miscommunicated if I previously said it was reading
> the
> > > full
> > > >> > index. Broker appears to be spending a lot of time just to read
> the
> > > last
> > > >> > entry of index files for every log segment. This is probably
> because
> > > OS
> > > >> > will load a chunk of data that is much larger than the entry
> itself
> > > from
> > > >> > disk to page cache. This KIP tries to make this part of operation
> > > lazy.
> > > >> I
> > > >> > guess you are suggesting that we should just make the lazy loading
> > the
> > > >> > default behavior?
> > > >> >
> > > >> > Yes we currently require manual intervention if the log file is
> > > >> corrupted,
> > > >> > i.e. if two messages with the same offset are appended to the disk
> > > >> > (KAFKA-6488). The sanity check on broker startup is a bit
> different
> > > >> since
> > > >> > it deals with the corruption of index files (e.g. offset index,
> time
> > > >> index
> > > >> > and snapshot files) instead of the log data. In this case if index
> > > files
> > > >> > are corrupted broker will automatically recover it by rebuilding
> the
> > > >> index
> > > >> > files using data in the log files, without requiring manual
> > > >> intervention.
> > > >> > Thus the design question is whether this should be done before
> > broker
> > > >> can
> > > >> > become leader for any partitions -- there is tradeoff between
> broker
> > > >> > startup time and risk of delaying user requests if broker need to
> > > >> rebuild
> > > >> > index files when it is already leader. I prefer lazy loading to
> > reduce
> > > >> > broker startup time. Not sure what are the feedback from the
> > community
> > > >> on
> > > >> > this issue.
> > > >> >
> > > >> > Thanks,
> > > >> > Dong
> > > >> >
> > > >> >
> > > >> > On Wed, Mar 21, 2018 at 7:36 AM, Jay Kreps <j...@confluent.io>
> > wrote:
> > > >> >
> > > >> > > Hey Dong,
> > > >> > >
> > > >> > > Makes total sense. What I'm saying is I don't think that the
> > sanity
> > > >> check
> > > >> > > is part of any formal guarantee we provide. It is true that
> > > >> corruption of
> > > >> > > data flushed to disk will be a potential problem, but I don't
> > think
> > > >> the
> > > >> > > sanity check solves that it just has a couple heuristics to help
> > > >> detect
> > > >> > > certain possible instances of it, right? In general I think our
> > > >> > assumption
> > > >> > > has been that flushed data doesn't disappear or get corrupted
> and
> > if
> > > >> it
> > > >> > > does you need to manually intervene. I don't think people want
> to
> > > >> > configure
> > > >> > > things at this level so what I was suggesting was understanding
> > why
> > > >> the
> > > >> > > sanity check is slow and trying to avoid that rather than making
> > it
> > > >> > > configurable. I think you mentioned it was reading the full
> index
> > > into
> > > >> > > memory. Based on the performance you describe this could be
> true,
> > > but
> > > >> it
> > > >> > > definitely should not be reading anything but the last entry in
> > the
> > > >> index
> > > >> > > so that would be a bug. That read also happens in sanityCheck()
> > only
> > > >> in
> > > >> > the
> > > >> > > time-based index right? In the offset index we do the same read
> > but
> > > it
> > > >> > > happens in initialization. If that read is the slow thing it
> might
> > > >> make
> > > >> > > sense to try to remove it or make it lazy in both cases. If it
> is
> > > some
> > > >> > > other part of the code then (e.g. the size check) then that may
> be
> > > >> able
> > > >> > to
> > > >> > > be avoided entirely (I think by the time we sanity check we
> > already
> > > >> know
> > > >> > > the file size from the mapping...). That was what I meant by
> doing
> > > >> some
> > > >> > > data driven analysis. Maybe a quick run with hprof would help
> > > >> determine
> > > >> > the
> > > >> > > root cause of why sanityCheck is slow?
> > > >> > >
> > > >> > > -Jay
> > > >> > >
> > > >> > > On Tue, Mar 20, 2018 at 12:13 AM Dong Lin <lindon...@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hey Jay,
> > > >> > > >
> > > >> > > > Thanks for your comments!
> > > >> > > >
> > > >> > > > Yeah recovery is different from the sanity check. They are
> > > >> correlated
> > > >> > in
> > > >> > > > the sense that there may still be corrupted index files even
> > after
> > > >> > clean
> > > >> > > > broker shutdown. And in that case if we delay the sanity check
> > > then
> > > >> we
> > > >> > > may
> > > >> > > > delay the log recovery. The main goal of this KIP is to
> optimize
> > > the
> > > >> > > sanity
> > > >> > > > check related work so that it does not delay the broker
> startup
> > > >> much.
> > > >> > > >
> > > >> > > > The KIP mentioned that the sanity check is done using log
> > recovery
> > > >> > > > background thread. The name "recovery" is mentioned mainly
> > because
> > > >> the
> > > >> > > > background thread number is determined using the existing
> > > >> > > > config num.recovery.threads.per.data.dir. I have updated the
> KIP
> > > to
> > > >> > make
> > > >> > > > this less confusing.
> > > >> > > >
> > > >> > > > It makes a ton of sense to optimize the broker startup time
> in a
> > > >> data
> > > >> > > > driven fashion. The currently optimize is done kind of in this
> > > >> fashion.
> > > >> > > The
> > > >> > > > broker log shows that LogManager.loadLogs() takes a long time
> in
> > > >> large
> > > >> > > > clusters. Then I started broker with cold cache and repeatedly
> > get
> > > >> > thread
> > > >> > > > dump to see what are broker threads are doing during
> > > >> > > LogManager.loadLogs().
> > > >> > > > Most of the threads are working on sanityCheck() and this
> > > motivates
> > > >> the
> > > >> > > > change in this KIP. Previously broker shutdown time was
> > > investigated
> > > >> > in a
> > > >> > > > similar data driven fashion and optimized with KAFKA-6172 and
> > > >> > KAFKA-6175.
> > > >> > > > It seems that the current KIP can reduces the rolling bounce
> > time
> > > >> of a
> > > >> > > > large cluster by 50% -- there may be room for further
> > improvement
> > > >> but
> > > >> > > maybe
> > > >> > > > those do not require as big a change (with the caveat
> described
> > in
> > > >> the
> > > >> > > KIP)
> > > >> > > > as suggested in this KIP.
> > > >> > > >
> > > >> > > > It is not clear whether it is safe to just read the latest
> > segment
> > > >> > > without
> > > >> > > > sanity checking all previous inactive segment of a given
> > partition
> > > >> if
> > > >> > > > transaction is used. Otherwise we probably want to always skip
> > the
> > > >> > sanity
> > > >> > > > check of inactive segments without introducing a new config.
> > Maybe
> > > >> the
> > > >> > > > developers familiar with the transaction can comment on that?
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > > Dong
> > > >> > > >
> > > >> > > >
> > > >> > > > On Mon, Mar 19, 2018 at 7:21 PM, Jay Kreps <j...@confluent.io>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > Optimizing startup seems really valuable but I'm a little
> > > >> confused by
> > > >> > > > this.
> > > >> > > > >
> > > >> > > > > There are two different things:
> > > >> > > > > 1. Recovery
> > > >> > > > > 2. Sanity check
> > > >> > > > >
> > > >> > > > > The terminology we're using is a bit mixed here.
> > > >> > > > >
> > > >> > > > > Recovery means checksumming the log segments and rebuilding
> > the
> > > >> index
> > > >> > > on
> > > >> > > > a
> > > >> > > > > hard crash. This only happens on unflushed segments, which
> is
> > > >> > generally
> > > >> > > > > just the last segment. Recovery is essential for the
> > correctness
> > > >> > > > guarantees
> > > >> > > > > of the log and you shouldn't disable it. It only happens on
> > hard
> > > >> > crash
> > > >> > > > and
> > > >> > > > > is not a factor in graceful restart. We can likely optimize
> it
> > > but
> > > >> > that
> > > >> > > > > would make most sense to do in a data driven fashion off
> some
> > > >> > > profiling.
> > > >> > > > >
> > > >> > > > > However there is also a ton of disk activity that happens
> > during
> > > >> > > > > initialization (lots of checks on the file size, absolute
> > path,
> > > >> > etc). I
> > > >> > > > > think these have crept in over time with people not really
> > > >> realizing
> > > >> > > this
> > > >> > > > > code is perf sensitive and java hiding a lot of what is and
> > > isn't
> > > >> a
> > > >> > > file
> > > >> > > > > operation. One part of this is the sanityCheck() call for
> the
> > > two
> > > >> > > > indexes.
> > > >> > > > > I don't think this call reads the full index, just the last
> > > entry
> > > >> in
> > > >> > > the
> > > >> > > > > index, right?. There should be no need to read the full
> index
> > > >> except
> > > >> > > > during
> > > >> > > > > recovery (and then only for the segments being recovered). I
> > > >> think it
> > > >> > > > would
> > > >> > > > > make a ton of sense to optimize this but I don't think that
> > > >> > > optimization
> > > >> > > > > needs to be configurable as this is just a helpful sanity
> > check
> > > to
> > > >> > > detect
> > > >> > > > > common non-sensical things in the index files, but it isn't
> > part
> > > >> of
> > > >> > the
> > > >> > > > > core guarantees, in general you aren't supposed to lose
> > > committed
> > > >> > data
> > > >> > > > from
> > > >> > > > > disk, and if you do we may be able to fail faster but we
> > > >> > fundamentally
> > > >> > > > > can't really help you. Again I think this would make the
> most
> > > >> sense
> > > >> > to
> > > >> > > do
> > > >> > > > > in a data driven way, if you look at that code I think it is
> > > doing
> > > >> > > crazy
> > > >> > > > > amounts of file operations (e.g. getAbsolutePath, file
> sizes,
> > > >> etc). I
> > > >> > > > think
> > > >> > > > > it'd make most sense to profile startup with a cold cash on
> a
> > > >> large
> > > >> > log
> > > >> > > > > directory and do the same with an strace to see how many
> > > redundant
> > > >> > > system
> > > >> > > > > calls we do per segment and what is costing us and then cut
> > some
> > > >> of
> > > >> > > this
> > > >> > > > > out. I suspect we could speed up our startup time quite a
> lot
> > if
> > > >> we
> > > >> > did
> > > >> > > > > that.
> > > >> > > > >
> > > >> > > > > For example we have a bunch of calls like this:
> > > >> > > > >
> > > >> > > > >     require(len % entrySize == 0,
> > > >> > > > >
> > > >> > > > >             "Index file " + file.getAbsolutePath + " is
> > corrupt,
> > > >> > found
> > > >> > > "
> > > >> > > > +
> > > >> > > > > len +
> > > >> > > > >
> > > >> > > > >             " bytes which is not positive or not a multiple
> of
> > > >> 8.")
> > > >> > > > > I'm pretty such file.getAbsolutePath is a system call and I
> > > assume
> > > >> > that
> > > >> > > > > happens whether or not you fail the in-memory check?
> > > >> > > > >
> > > >> > > > > -Jay
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Sun, Feb 25, 2018 at 10:27 PM, Dong Lin <
> > lindon...@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi all,
> > > >> > > > > >
> > > >> > > > > > I have created KIP-263: Allow broker to skip sanity check
> of
> > > >> > inactive
> > > >> > > > > > segments on broker startup. See
> > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >> > > > > > 263%3A+Allow+broker+to+skip+sanity+check+of+inactive+
> > > >> > > > > > segments+on+broker+startup
> > > >> > > > > > .
> > > >> > > > > >
> > > >> > > > > > This KIP provides a way to significantly reduce time to
> > > rolling
> > > >> > > bounce
> > > >> > > > a
> > > >> > > > > > Kafka cluster.
> > > >> > > > > >
> > > >> > > > > > Comments are welcome!
> > > >> > > > > >
> > > >> > > > > > Thanks,
> > > >> > > > > > Dong
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-263: Allow broker to skip sanity check of inactive segments on broker startup

Reply via email to