Hi Becket, +1
Just a few more minor comments that I don't think should result in any material change to the KIP: - Under "time index format": *A time index entry (timestamp, offset) means that any message whose timestamp is greater than timestamp come after offset. * - I think it may be clear enough from the context, but I think this is only true within a segment so consider making that explicit. Also, a minor nit-pick that would help reduce the burden on the reader: the text uses the same label for all three references - i.e., (*timestamp*, offset) and then two mentions of *timestamp* in the same sentence. Would be clearer if it is something like: *A time index entry (T, offset) means that any message whose timestamp is greater than T ...* - Under "build the time index" - *User should set the configuration accordingly* is vague. You mean in order to satisfy some user-determined size constraint? - *The time index is not globally monotonically increasing*: globally is a confusing word to use here. I think you basically mean to say does not necessarily monotonically increase within a partition. Instead, it is only monotonically increasing within each time index... - *The density of the index is defined by index.interval.bytes*: I don't quite see how this is directly defined by that configuration. i.e., it has a direct effect on the regular offset index density but only a secondary effect on the time index density right? - Under "Enforce time-based log retention": - I vaguely remember some mention of avoiding gaps - i.e., we enforce retention left to right. i.e., start removing from the first (by creation) log segment and stop when we reach an ineligible segment. Unless something changed may be worth explicitly calling that out. - There are a few more minor clarifying edits that I can help with offline. Thanks, Joel On Mon, Apr 18, 2016 at 7:35 PM, Becket Qin <becket....@gmail.com> wrote: > Hi Grant, > > Thanks for the review. Please see the replies inline. > > On Mon, Apr 18, 2016 at 1:56 PM, Grant Henke <ghe...@cloudera.com> wrote: > > > Thanks for the detailed write up Jiangjie. Overall the proposal looks > good > > to me. I have a few implementation detail questions that > > don't necessarily need to block progress: > > > > When searching by timestamp, broker will start from the earliest log > > > segment and check the last time index entry. If the timestamp of the > last > > > time index entry is greater than the target timestamp, the broker will > do > > > binary search on that time index to find the closest index entry and > scan > > > the log from there. Otherwise it will move on to the next log segment. > > > > > > > Does this mean having more old data will increase the length of timestamp > > searches? Have you considered searching from the most recent segment? I > am > > thinking looking up more recent data is likely the common case but > haven't > > thought about it too much. > > > > Segment level search is very quick. i.e. read the last time index entry > of > a segment and see if it is greater than the target timestamp. > Having more old data does mean we need to search more segments, but since > checking on a segment is cheap (12 bytes read) it is probably fine. > > > > To enforce time based log retention, the broker will check the last time > > > index entry of a log segment. The timestamp will be the latest > timestamp > > of > > > the messages in the log segment. So if that timestamp expires, the > broker > > > will delete the log segment. > > > > > > > Does this mean the choice between message.timestamp.type affects > retention > > time? > > The time index is timestamp type agnostic. No matter the timestamp type is > CreateTime or LogAppendTime, the same algorithm would work. > > > > > > When broker receives a message, if the message is not rejected due to > > > timestamp exceeds threshold, the message will be appended to the log. > > (The > > > timestamp will either be LogAppendTime or CreateTime depending on the > > > configuration) > > > > > > > What happens when message.timestamp.type is changed? > > > Because the time index is timestamp type agnostic. The message timestamp > change is transparent to the time index. So it seems there is no special > cases to be taken care of. > > > > > Are there any special considerations for compacted topics? > > I did not see anything special there. The only thing special might be > appending the last time index entry to the cleaned segment after cleaning. > > > > > Thank you, > > Grant > > > > > > On Mon, Apr 18, 2016 at 8:50 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > > > > Hi Jiangjie, > > > > > > Thanks for the KIP, it's a nice improvement. Since it seems like we > have > > > been using the voting thread for discussion, I'll do the same. > > > > > > A few minor comments/questions: > > > > > > 1. The proposed name for the time index file > > `SegmentBaseOffset.timeindex`. > > > Would `SegmentBaseOffset.time-index` be a little better? It would > clearly > > > separate the type of index in case we add additional index types in the > > > future. > > > 2. When describing the time index entry, we say "Offset - the next > offset > > > when the time index entry is inserted". I found the mention of `next` a > > bit > > > confusing as it looks to me like the time index entry has the first > > offset > > > in the message set. > > > 3. We say "The default initial / max size of the time index files is > the > > > same as the offset index files. (time index entry is 1.5x of the size > of > > > offset index entry, user should set the configuration accordingly)". It > > may > > > be worth elaborating a little on what a user should do with regards to > > this > > > configuration when upgrading (ie maybe under "Compatibility, > Deprecation, > > > and Migration Plan"). > > > 4. In a previous vote thread, Jun said "The simplest thing is probably > > > to change > > > the default index size to 2MB to match the default log segment size" > and > > > you seemed to agree. I couldn't find anything about this in the KIP. > Are > > we > > > still doing it? > > > 5. We say "Instead, it is only monotonically increasing within each > time > > > index file. i.e. It is possible that the time index file for a later > log > > > segment contains smaller timestamp than some timestamp in the time > index > > > file of an earlier segment.". I think it would be good to explain under > > > which scenario a time index file for a later log segment contains a > > smaller > > > timestamp (is this only when CreateTime is used?). > > > 6. We say "When searching by timestamp, broker will start from the > > earliest > > > log segment and check the last time index entry.". The existing logic > > > searches from newest segment backwards. Is there a reason why we are > > > changing it? > > > 7. Do you mind if I fix typos and minor grammar issues directly in the > > > wiki? It seems easier than doing that via email. > > > > > > Thanks, > > > Ismael > > > > > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <becket....@gmail.com> > wrote: > > > > > > > Hi all, > > > > > > > > I updated KIP-33 based on the initial implementation. Per discussion > on > > > > yesterday's KIP hangout, I would like to initiate the new vote thread > > for > > > > KIP-33. > > > > > > > > The KIP wiki: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index > > > > > > > > Here is a brief summary of the KIP: > > > > 1. We propose to add a time index for each log segment. > > > > 2. The time indices are going to be used of log retention, log > rolling > > > and > > > > message search by timestamp. > > > > > > > > There was an old voting thread which has some discussions on this > KIP. > > > The > > > > mail thread link is following: > > > > > > > > > > > > > > http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3ccabtagwgoebukyapfpchmycjk2tepq3ngtuwnhtr2tjvsnc8...@mail.gmail.com%3E > > > > > > > > I have the following WIP patch for reference. It needs a few more > unit > > > > tests and documentation. Other than that it should run fine. > > > > > > > > > > > > > > https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7 > > > > > > > > Thanks, > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > > > > -- > > Grant Henke > > Software Engineer | Cloudera > > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke > > >