Re: [DISCUSS] KIP-32 - Add CreateTime and LogAppendTime to Kafka message

Guozhang Wang Tue, 26 Jan 2016 22:56:10 -0800

One motivation of my proposal is actually to avoid any clients trying to
read the timestamp type from the topic metadata response and behave
differently since:


1) topic metadata response is not always in-sync with the source-of-truth
(ZK), hence when the clients realized that the config has changed it may
already be too late (i.e. for consumer the records with the wrong timestamp
could already be returned to user).

2) the client logic could be a bit simpler, and this will benefit non-Java
development a lot. Also we can avoid adding this into the topic metadata
response.

Guozhang

On Tue, Jan 26, 2016 at 3:20 PM, Becket Qin <becket....@gmail.com> wrote:

> My hesitation for the changed protocol is that I think If we will have
> topic configuration returned in the topic metadata, the current protocol
> makes more sense. Because the timestamp type is a topic level setting so we
> don't need to put it into each message. That is assuming the timestamp type
> change on a topic rarely happens and if it is ever needed, the existing
> data should be wiped out.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Jan 26, 2016 at 2:07 PM, Becket Qin <becket....@gmail.com> wrote:
>
> > Bump up this thread per discussion on the KIP hangout.
> >
> > During the implementation of the KIP, Guozhang raised another proposal on
> > how to indicate the message timestamp type used by messages. So we want
> to
> > see people's opinion on this proposal.
> >
> > The difference between current and the new proposal only differs on
> > messages that are a) compressed, and b) using LogAppendTime
> >
> > For compressed messages using LogAppendTime, the timestamps in the
> current
> > proposal is as below:
> > 1. When a producer produces the messages, it tries to set timestamp to -1
> > for inner messages if it knows LogAppendTime is used.
> > 2. When a broker receives the messages, it will overwrite the timestamp
> of
> > inner message to -1 if needed and write server time to the wrapper
> message.
> > Broker will do re-compression if inner message timestamp is overwritten.
> > 3. When a consumer receives the messages, it will see the inner message
> > timestamp is -1 so the wrapper message timestamp is used.
> >
> > Implementation wise, this proposal requires the producer to set timestamp
> > for inner messages correctly to avoid broker side re-compression. To do
> > that, the short term solution is to let producer infer the timestamp type
> > returned by broker in ProduceResponse and set correct timestamp
> afterwards.
> > This means the first few batches will still need re-compression on the
> > broker. The long term solution is to have producer get topic
> configuration
> > during metadata update.
> >
> >
> > The proposed modification is:
> > 1. When a producer produces the messages, it always using create time.
> > 2. When a broker receives the messages, it ignores the inner messages
> > timestamp, but simply set a wrapper message timestamp type attribute bit
> to
> > 1 and set the timestamp of the wrapper message to server time. (The
> broker
> > will also set the timesatmp type attribute bit accordingly for
> > non-compressed messages using LogAppendTime).
> > 3. When a consumer receives the messages, it checks timestamp type
> > attribute bit of wrapper message. If it is set to 1, the inner message's
> > timestamp will be ignored and the wrapper message's timestamp will be
> used.
> >
> > This approach uses an extra attribute bit. The good thing of the modified
> > protocol is consumers will be able to know the timestamp type. And
> > re-compression on broker side is completely avoided no matter what value
> is
> > sent by the producer. In this approach the inner messages will have wrong
> > timestamps.
> >
> > We want to see if people have concerns over the modified approach.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> >
> > On Tue, Dec 15, 2015 at 11:45 AM, Becket Qin <becket....@gmail.com>
> wrote:
> >
> >> Jun,
> >>
> >> 1. I agree it would be nice to have the timestamps used in a unified
> way.
> >> My concern is that if we let server change timestamp of the inner
> message
> >> for LogAppendTime, that will enforce the user who are using
> LogAppendTime
> >> to always pay the recompression penalty. So using LogAppendTime makes
> >> KIP-31 in vain.
> >>
> >> 4. If there are no entries in the log segment, we can read from the time
> >> index before the previous log segment. If there is no previous entry
> >> avaliable after we search until the earliest log segment, that means all
> >> the previous log segment with a valid time index entry has been
> deleted. In
> >> that case supposedly there should be only one log segment left - the
> active
> >> log segment, we can simply set the latest timestamp to 0.
> >>
> >> Guozhang,
> >>
> >> Sorry for the confusion. by "the timestamp of the latest message" I
> >> actually meant "the timestamp of the message with largest timestamp".
> So in
> >> your example the "latest message" is 5.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >>
> >> On Tue, Dec 15, 2015 at 10:02 AM, Guozhang Wang <wangg...@gmail.com>
> >> wrote:
> >>
> >>> Jun, Jiangjie,
> >>>
> >>> I am confused about 3) here, if we use "the timestamp of the latest
> >>> message"
> >>> then doesn't this mean we will roll the log whenever a message delayed
> by
> >>> rolling time is received as well? Just to clarify, my understanding of
> >>> "the
> >>> timestamp of the latest message", for example in the following log, is
> 1,
> >>> not 5:
> >>>
> >>> 2, 3, 4, 5, 1
> >>>
> >>> Guozhang
> >>>
> >>>
> >>> On Mon, Dec 14, 2015 at 10:05 PM, Jun Rao <j...@confluent.io> wrote:
> >>>
> >>> > 1. Hmm, it's more intuitive if the consumer sees the same timestamp
> >>> whether
> >>> > the messages are compressed or not. When
> >>> > message.timestamp.type=LogAppendTime,
> >>> > we will need to set timestamp in each message if messages are not
> >>> > compressed, so that the follower can get the same timestamp. So, it
> >>> seems
> >>> > that we should do the same thing for inner messages when messages are
> >>> > compressed.
> >>> >
> >>> > 4. I thought on startup, we restore the timestamp of the latest
> >>> message by
> >>> > reading from the time index of the last log segment. So, what happens
> >>> if
> >>> > there are no index entries?
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Jun
> >>> >
> >>> > On Mon, Dec 14, 2015 at 6:28 PM, Becket Qin <becket....@gmail.com>
> >>> wrote:
> >>> >
> >>> > > Thanks for the explanation, Jun.
> >>> > >
> >>> > > 1. That makes sense. So maybe we can do the following:
> >>> > > (a) Set the timestamp in the compressed message to latest timestamp
> >>> of
> >>> > all
> >>> > > its inner messages. This works for both LogAppendTime and
> CreateTime.
> >>> > > (b) If message.timestamp.type=LogAppendTime, the broker will
> >>> overwrite
> >>> > all
> >>> > > the inner message timestamp to -1 if they are not set to -1. This
> is
> >>> > mainly
> >>> > > for topics that are using LogAppendTime. Hopefully the producer
> will
> >>> set
> >>> > > the timestamp to -1 in the ProducerRecord to avoid server side
> >>> > > recompression.
> >>> > >
> >>> > > 3. I see. That works. So the semantic of log rolling becomes "roll
> >>> out
> >>> > the
> >>> > > log segment if it has been inactive since the latest message has
> >>> > arrived."
> >>> > >
> >>> > > 4. Yes. If the largest timestamp is in previous log segment. The
> time
> >>> > index
> >>> > > for the current log segment does not have a valid offset in current
> >>> log
> >>> > > segment to point to. Maybe in that case we should build an empty
> log
> >>> > index.
> >>> > >
> >>> > > Thanks,
> >>> > >
> >>> > > Jiangjie (Becket) Qin
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Mon, Dec 14, 2015 at 5:51 PM, Jun Rao <j...@confluent.io> wrote:
> >>> > >
> >>> > > > 1. I was thinking more about saving the decompression overhead in
> >>> the
> >>> > > > follower. Currently, the follower doesn't decompress the
> messages.
> >>> To
> >>> > > keep
> >>> > > > it that way, the outer message needs to include the timestamp of
> >>> the
> >>> > > latest
> >>> > > > inner message to build the time index in the follower. The
> simplest
> >>> > thing
> >>> > > > to do is to change the timestamp in the inner messages if
> >>> necessary, in
> >>> > > > which case there will be the recompression overhead. However, in
> >>> the
> >>> > case
> >>> > > > when the timestamp of the inner messages don't have to be changed
> >>> > > > (hopefully more common), there won't be the recompression
> >>> overhead. In
> >>> > > > either case, we always set the timestamp in the outer message to
> >>> be the
> >>> > > > timestamp of the latest inner message, in the leader.
> >>> > > >
> >>> > > > 3. Basically, in each log segment, we keep track of the timestamp
> >>> of
> >>> > the
> >>> > > > latest message. If current time - timestamp of latest message >
> log
> >>> > > rolling
> >>> > > > interval, we roll a new log segment. So, if messages with later
> >>> > > timestamps
> >>> > > > keep getting added, we only roll new log segments based on size.
> >>> On the
> >>> > > > other hand, if no new messages are added to a log, we can force a
> >>> log
> >>> > > roll
> >>> > > > based on time, which addresses the issue in (b).
> >>> > > >
> >>> > > > 4. Hmm, the index is per segment and should only point to
> >>> positions in
> >>> > > the
> >>> > > > corresponding .log file, not previous ones, right?
> >>> > > >
> >>> > > > Thanks,
> >>> > > >
> >>> > > > Jun
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > On Mon, Dec 14, 2015 at 3:10 PM, Becket Qin <
> becket....@gmail.com>
> >>> > > wrote:
> >>> > > >
> >>> > > > > Hi Jun,
> >>> > > > >
> >>> > > > > Thanks a lot for the comments. Please see inline replies.
> >>> > > > >
> >>> > > > > Thanks,
> >>> > > > >
> >>> > > > > Jiangjie (Becket) Qin
> >>> > > > >
> >>> > > > > On Mon, Dec 14, 2015 at 10:19 AM, Jun Rao <j...@confluent.io>
> >>> wrote:
> >>> > > > >
> >>> > > > > > Hi, Becket,
> >>> > > > > >
> >>> > > > > > Thanks for the proposal. Looks good overall. A few comments
> >>> below.
> >>> > > > > >
> >>> > > > > > 1. KIP-32 didn't say what timestamp should be set in a
> >>> compressed
> >>> > > > > message.
> >>> > > > > > We probably should set it to the timestamp of the latest
> >>> messages
> >>> > > > > included
> >>> > > > > > in the compressed one. This way, during indexing, we don't
> >>> have to
> >>> > > > > > decompress the message.
> >>> > > > > >
> >>> > > > > That is a good point.
> >>> > > > > In normal cases, broker needs to decompress the message for
> >>> > > verification
> >>> > > > > purpose anyway. So building time index does not add additional
> >>> > > > > decompression.
> >>> > > > > During time index recovery, however, having a timestamp in
> >>> compressed
> >>> > > > > message might save the decompression.
> >>> > > > >
> >>> > > > > Another thing I am thinking is we should make sure KIP-32 works
> >>> well
> >>> > > with
> >>> > > > > KIP-31. i.e. we don't want to do recompression in order to add
> >>> > > timestamp
> >>> > > > to
> >>> > > > > messages.
> >>> > > > > Take the approach in my last email, the timestamp in the
> messages
> >>> > will
> >>> > > > > either all be overwritten by server if
> >>> > > > > message.timestamp.type=LogAppendTime, or they will not be
> >>> overwritten
> >>> > > if
> >>> > > > > message.timestamp.type=CreateTime.
> >>> > > > >
> >>> > > > > Maybe we can use the timestamp in compressed messages in the
> >>> > following
> >>> > > > way:
> >>> > > > > If message.timestamp.type=LogAppendTime, we have to overwrite
> >>> > > timestamps
> >>> > > > > for all the messages. We can simply write the timestamp in the
> >>> > > compressed
> >>> > > > > message to avoid recompression.
> >>> > > > > If message.timestamp.type=CreateTime, we do not need to
> >>> overwrite the
> >>> > > > > timestamps. We either reject the entire compressed message or
> We
> >>> just
> >>> > > > leave
> >>> > > > > the compressed message timestamp to be -1.
> >>> > > > >
> >>> > > > > So the semantic of the timestamp field in compressed message
> >>> field
> >>> > > > becomes:
> >>> > > > > if it is greater than 0, that means LogAppendTime is used, the
> >>> > > timestamp
> >>> > > > of
> >>> > > > > the inner messages is the compressed message LogAppendTime. If
> >>> it is
> >>> > > -1,
> >>> > > > > that means the CreateTime is used, the timestamp is in each
> >>> > individual
> >>> > > > > inner message.
> >>> > > > >
> >>> > > > > This sacrifice the speed of recovery but seems worthy because
> we
> >>> > avoid
> >>> > > > > recompression.
> >>> > > > >
> >>> > > > >
> >>> > > > > > 2. In KIP-33, should we make the time-based index interval
> >>> > > > configurable?
> >>> > > > > > Perhaps we can default it 60 secs, but allow users to
> >>> configure it
> >>> > to
> >>> > > > > > smaller values if they want more precision.
> >>> > > > > >
> >>> > > > > Yes, we can do that.
> >>> > > > >
> >>> > > > >
> >>> > > > > > 3. In KIP-33, I am not sure if log rolling should be based on
> >>> the
> >>> > > > > earliest
> >>> > > > > > message. This would mean that we will need to roll a log
> >>> segment
> >>> > > every
> >>> > > > > time
> >>> > > > > > we get a message delayed by the log rolling time interval.
> >>> Also, on
> >>> > > > > broker
> >>> > > > > > startup, we can get the timestamp of the latest message in a
> >>> log
> >>> > > > segment
> >>> > > > > > pretty efficiently by just looking at the last time index
> >>> entry.
> >>> > But
> >>> > > > > > getting the timestamp of the earliest timestamp requires a
> full
> >>> > scan
> >>> > > of
> >>> > > > > all
> >>> > > > > > log segments, which can be expensive. Previously, there were
> >>> two
> >>> > use
> >>> > > > > cases
> >>> > > > > > of time-based rolling: (a) more accurate time-based indexing
> >>> and
> >>> > (b)
> >>> > > > > > retaining data by time (since the active segment is never
> >>> deleted).
> >>> > > (a)
> >>> > > > > is
> >>> > > > > > already solved with a time-based index. For (b), if the
> >>> retention
> >>> > is
> >>> > > > > based
> >>> > > > > > on the timestamp of the latest message in a log segment,
> >>> perhaps
> >>> > log
> >>> > > > > > rolling should be based on that too.
> >>> > > > > >
> >>> > > > > I am not sure how to make log rolling work with the latest
> >>> timestamp
> >>> > in
> >>> > > > > current log segment. Do you mean the log rolling can based on
> the
> >>> > last
> >>> > > > log
> >>> > > > > segment's latest timestamp? If so how do we roll out the first
> >>> > segment?
> >>> > > > >
> >>> > > > >
> >>> > > > > > 4. In KIP-33, I presume the timestamp in the time index will
> be
> >>> > > > > > monotonically increasing. So, if all messages in a log
> segment
> >>> > have a
> >>> > > > > > timestamp less than the largest timestamp in the previous log
> >>> > > segment,
> >>> > > > we
> >>> > > > > > will use the latter to index this log segment?
> >>> > > > > >
> >>> > > > > Yes. The timestamps are monotonically increasing. If the
> largest
> >>> > > > timestamp
> >>> > > > > in the previous segment is very big, it is possible the time
> >>> index of
> >>> > > the
> >>> > > > > current segment only have two index entries (inserted during
> >>> segment
> >>> > > > > creation and roll out), both are pointing to a message in the
> >>> > previous
> >>> > > > log
> >>> > > > > segment. This is the corner case I mentioned before that we
> >>> should
> >>> > > expire
> >>> > > > > the next log segment even before expiring the previous log
> >>> segment
> >>> > just
> >>> > > > > because the largest timestamp is in previous log segment. In
> >>> current
> >>> > > > > approach, we will wait until the previous log segment expires,
> >>> and
> >>> > then
> >>> > > > > delete both the previous log segment and the next log segment.
> >>> > > > >
> >>> > > > >
> >>> > > > > > 5. In KIP-32, in the wire protocol, we mention both timestamp
> >>> and
> >>> > > time.
> >>> > > > > > They should be consistent.
> >>> > > > > >
> >>> > > > > Will fix the wiki page.
> >>> > > > >
> >>> > > > >
> >>> > > > > > Jun
> >>> > > > > >
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > On Thu, Dec 10, 2015 at 10:13 AM, Becket Qin <
> >>> becket....@gmail.com
> >>> > >
> >>> > > > > wrote:
> >>> > > > > >
> >>> > > > > > > Hey Jay,
> >>> > > > > > >
> >>> > > > > > > Thanks for the comments.
> >>> > > > > > >
> >>> > > > > > > Good point about the actions after when
> >>> > max.message.time.difference
> >>> > > > is
> >>> > > > > > > exceeded. Rejection is a useful behavior although I cannot
> >>> think
> >>> > of
> >>> > > > use
> >>> > > > > > > case at LinkedIn at this moment. I think it makes sense to
> >>> add a
> >>> > > > > > > configuration.
> >>> > > > > > >
> >>> > > > > > > How about the following configurations?
> >>> > > > > > > 1. message.timestamp.type=CreateTime/LogAppendTime
> >>> > > > > > > 2. max.message.time.difference.ms
> >>> > > > > > >
> >>> > > > > > > if message.timestamp.type is set to CreateTime, when the
> >>> broker
> >>> > > > > receives
> >>> > > > > > a
> >>> > > > > > > message, it will further check
> >>> max.message.time.difference.ms,
> >>> > and
> >>> > > > > will
> >>> > > > > > > reject the message it the time difference exceeds the
> >>> threshold.
> >>> > > > > > > If message.timestamp.type is set to LogAppendTime, the
> broker
> >>> > will
> >>> > > > > always
> >>> > > > > > > stamp the message with current server time, regardless the
> >>> value
> >>> > of
> >>> > > > > > > max.message.time.difference.ms
> >>> > > > > > >
> >>> > > > > > > This will make sure the message on the broker is either
> >>> > CreateTime
> >>> > > or
> >>> > > > > > > LogAppendTime, but not mixture of both.
> >>> > > > > > >
> >>> > > > > > > What do you think?
> >>> > > > > > >
> >>> > > > > > > Thanks,
> >>> > > > > > >
> >>> > > > > > > Jiangjie (Becket) Qin
> >>> > > > > > >
> >>> > > > > > >
> >>> > > > > > > On Wed, Dec 9, 2015 at 2:42 PM, Jay Kreps <
> j...@confluent.io>
> >>> > > wrote:
> >>> > > > > > >
> >>> > > > > > > > Hey Becket,
> >>> > > > > > > >
> >>> > > > > > > > That summary of pros and cons sounds about right to me.
> >>> > > > > > > >
> >>> > > > > > > > There are potentially two actions you could take when
> >>> > > > > > > > max.message.time.difference is exceeded--override it or
> >>> reject
> >>> > > the
> >>> > > > > > > > message entirely. Can we pick one of these or does the
> >>> action
> >>> > > need
> >>> > > > to
> >>> > > > > > > > be configurable too? (I'm not sure). The downside of more
> >>> > > > > > > > configuration is that it is more fiddly and has more
> modes.
> >>> > > > > > > >
> >>> > > > > > > > I suppose the reason I was thinking of this as a
> >>> "difference"
> >>> > > > rather
> >>> > > > > > > > than a hard type was that if you were going to go the
> >>> reject
> >>> > mode
> >>> > > > you
> >>> > > > > > > > would need some tolerance setting (i.e. if your SLA is
> >>> that if
> >>> > > your
> >>> > > > > > > > timestamp is off by more than 10 minutes I give you an
> >>> error).
> >>> > I
> >>> > > > > agree
> >>> > > > > > > > with you that having one field that is potentially
> >>> containing a
> >>> > > mix
> >>> > > > > of
> >>> > > > > > > > two values is a bit weird.
> >>> > > > > > > >
> >>> > > > > > > > -Jay
> >>> > > > > > > >
> >>> > > > > > > > On Mon, Dec 7, 2015 at 5:17 PM, Becket Qin <
> >>> > becket....@gmail.com
> >>> > > >
> >>> > > > > > wrote:
> >>> > > > > > > > > It looks the format of the previous email was messed
> up.
> >>> Send
> >>> > > it
> >>> > > > > > again.
> >>> > > > > > > > >
> >>> > > > > > > > > Just to recap, the last proposal Jay made (with some
> >>> > > > implementation
> >>> > > > > > > > > details added)
> >>> > > > > > > > > was:
> >>> > > > > > > > >
> >>> > > > > > > > > 1. Allow user to stamp the message when produce
> >>> > > > > > > > >
> >>> > > > > > > > > 2. When broker receives a message it take a look at the
> >>> > > > difference
> >>> > > > > > > > between
> >>> > > > > > > > > its local time and the timestamp in the message.
> >>> > > > > > > > >   a. If the time difference is within a configurable
> >>> > > > > > > > > max.message.time.difference.ms, the server will accept
> >>> it
> >>> > and
> >>> > > > > append
> >>> > > > > > > it
> >>> > > > > > > > to
> >>> > > > > > > > > the log.
> >>> > > > > > > > >   b. If the time difference is beyond the configured
> >>> > > > > > > > > max.message.time.difference.ms, the server will
> >>> override the
> >>> > > > > > timestamp
> >>> > > > > > > > with
> >>> > > > > > > > > its current local time and append the message to the
> log.
> >>> > > > > > > > >   c. The default value of max.message.time.difference
> >>> would
> >>> > be
> >>> > > > set
> >>> > > > > to
> >>> > > > > > > > > Long.MaxValue.
> >>> > > > > > > > >
> >>> > > > > > > > > 3. The configurable time difference threshold
> >>> > > > > > > > > max.message.time.difference.ms will
> >>> > > > > > > > > be a per topic configuration.
> >>> > > > > > > > >
> >>> > > > > > > > > 4. The indexed will be built so it has the following
> >>> > guarantee.
> >>> > > > > > > > >   a. If user search by time stamp:
> >>> > > > > > > > >       - all the messages after that timestamp will be
> >>> > consumed.
> >>> > > > > > > > >       - user might see earlier messages.
> >>> > > > > > > > >   b. The log retention will take a look at the last
> time
> >>> > index
> >>> > > > > entry
> >>> > > > > > in
> >>> > > > > > > > the
> >>> > > > > > > > > time index file. Because the last entry will be the
> >>> latest
> >>> > > > > timestamp
> >>> > > > > > in
> >>> > > > > > > > the
> >>> > > > > > > > > entire log segment. If that entry expires, the log
> >>> segment
> >>> > will
> >>> > > > be
> >>> > > > > > > > deleted.
> >>> > > > > > > > >   c. The log rolling has to depend on the earliest
> >>> timestamp.
> >>> > > In
> >>> > > > > this
> >>> > > > > > > > case
> >>> > > > > > > > > we may need to keep a in memory timestamp only for the
> >>> > current
> >>> > > > > active
> >>> > > > > > > > log.
> >>> > > > > > > > > On recover, we will need to read the active log segment
> >>> to
> >>> > get
> >>> > > > this
> >>> > > > > > > > timestamp
> >>> > > > > > > > > of the earliest messages.
> >>> > > > > > > > >
> >>> > > > > > > > > 5. The downside of this proposal are:
> >>> > > > > > > > >   a. The timestamp might not be monotonically
> increasing.
> >>> > > > > > > > >   b. The log retention might become non-deterministic.
> >>> i.e.
> >>> > > When
> >>> > > > a
> >>> > > > > > > > message
> >>> > > > > > > > > will be deleted now depends on the timestamp of the
> other
> >>> > > > messages
> >>> > > > > in
> >>> > > > > > > the
> >>> > > > > > > > > same log segment. And those timestamps are provided by
> >>> > > > > > > > > user within a range depending on what the time
> difference
> >>> > > > threshold
> >>> > > > > > > > > configuration is.
> >>> > > > > > > > >   c. The semantic meaning of the timestamp in the
> >>> messages
> >>> > > could
> >>> > > > > be a
> >>> > > > > > > > little
> >>> > > > > > > > > bit vague because some of them come from the producer
> and
> >>> > some
> >>> > > of
> >>> > > > > > them
> >>> > > > > > > > are
> >>> > > > > > > > > overwritten by brokers.
> >>> > > > > > > > >
> >>> > > > > > > > > 6. Although the proposal has some downsides, it gives
> >>> user
> >>> > the
> >>> > > > > > > > flexibility
> >>> > > > > > > > > to use the timestamp.
> >>> > > > > > > > >   a. If the threshold is set to Long.MaxValue. The
> >>> timestamp
> >>> > in
> >>> > > > the
> >>> > > > > > > > message is
> >>> > > > > > > > > equivalent to CreateTime.
> >>> > > > > > > > >   b. If the threshold is set to 0. The timestamp in the
> >>> > message
> >>> > > > is
> >>> > > > > > > > equivalent
> >>> > > > > > > > > to LogAppendTime.
> >>> > > > > > > > >
> >>> > > > > > > > > This proposal actually allows user to use either
> >>> CreateTime
> >>> > or
> >>> > > > > > > > LogAppendTime
> >>> > > > > > > > > without introducing two timestamp concept at the same
> >>> time. I
> >>> > > > have
> >>> > > > > > > > updated
> >>> > > > > > > > > the wiki for KIP-32 and KIP-33 with this proposal.
> >>> > > > > > > > >
> >>> > > > > > > > > One thing I am thinking is that instead of having a
> time
> >>> > > > difference
> >>> > > > > > > > threshold,
> >>> > > > > > > > > should we simply set have a TimestampType
> configuration?
> >>> > > Because
> >>> > > > in
> >>> > > > > > > most
> >>> > > > > > > > > cases, people will either set the threshold to 0 or
> >>> > > > Long.MaxValue.
> >>> > > > > > > > Setting
> >>> > > > > > > > > anything in between will make the timestamp in the
> >>> message
> >>> > > > > > meaningless
> >>> > > > > > > to
> >>> > > > > > > > > user - user don't know if the timestamp has been
> >>> overwritten
> >>> > by
> >>> > > > the
> >>> > > > > > > > brokers.
> >>> > > > > > > > >
> >>> > > > > > > > > Any thoughts?
> >>> > > > > > > > >
> >>> > > > > > > > > Thanks,
> >>> > > > > > > > > Jiangjie (Becket) Qin
> >>> > > > > > > > >
> >>> > > > > > > > > On Mon, Dec 7, 2015 at 10:33 AM, Jiangjie Qin
> >>> > > > > > > <j...@linkedin.com.invalid
> >>> > > > > > > > >
> >>> > > > > > > > > wrote:
> >>> > > > > > > > >
> >>> > > > > > > > >> Bump up this thread.
> >>> > > > > > > > >>
> >>> > > > > > > > >> Just to recap, the last proposal Jay made (with some
> >>> > > > > implementation
> >>> > > > > > > > details
> >>> > > > > > > > >> added) was:
> >>> > > > > > > > >>
> >>> > > > > > > > >>    1. Allow user to stamp the message when produce
> >>> > > > > > > > >>    2. When broker receives a message it take a look at
> >>> the
> >>> > > > > > difference
> >>> > > > > > > > >>    between its local time and the timestamp in the
> >>> message.
> >>> > > > > > > > >>       - If the time difference is within a
> configurable
> >>> > > > > > > > >>       max.message.time.difference.ms, the server will
> >>> > accept
> >>> > > it
> >>> > > > > and
> >>> > > > > > > > append
> >>> > > > > > > > >>       it to the log.
> >>> > > > > > > > >>       - If the time difference is beyond the
> configured
> >>> > > > > > > > >>       max.message.time.difference.ms, the server will
> >>> > > override
> >>> > > > > the
> >>> > > > > > > > >>       timestamp with its current local time and append
> >>> the
> >>> > > > message
> >>> > > > > > to
> >>> > > > > > > > the
> >>> > > > > > > > >> log.
> >>> > > > > > > > >>       - The default value of
> max.message.time.difference
> >>> > would
> >>> > > > be
> >>> > > > > > set
> >>> > > > > > > to
> >>> > > > > > > > >>       Long.MaxValue.
> >>> > > > > > > > >>       3. The configurable time difference threshold
> >>> > > > > > > > >>    max.message.time.difference.ms will be a per topic
> >>> > > > > > configuration.
> >>> > > > > > > > >>    4. The indexed will be built so it has the
> following
> >>> > > > guarantee.
> >>> > > > > > > > >>       - If user search by time stamp:
> >>> > > > > > > > >>    - all the messages after that timestamp will be
> >>> consumed.
> >>> > > > > > > > >>       - user might see earlier messages.
> >>> > > > > > > > >>       - The log retention will take a look at the last
> >>> time
> >>> > > > index
> >>> > > > > > > entry
> >>> > > > > > > > in
> >>> > > > > > > > >>       the time index file. Because the last entry will
> >>> be
> >>> > the
> >>> > > > > latest
> >>> > > > > > > > >> timestamp in
> >>> > > > > > > > >>       the entire log segment. If that entry expires,
> >>> the log
> >>> > > > > segment
> >>> > > > > > > > will
> >>> > > > > > > > >> be
> >>> > > > > > > > >>       deleted.
> >>> > > > > > > > >>       - The log rolling has to depend on the earliest
> >>> > > timestamp.
> >>> > > > > In
> >>> > > > > > > this
> >>> > > > > > > > >>       case we may need to keep a in memory timestamp
> >>> only
> >>> > for
> >>> > > > the
> >>> > > > > > > > >> current active
> >>> > > > > > > > >>       log. On recover, we will need to read the active
> >>> log
> >>> > > > segment
> >>> > > > > > to
> >>> > > > > > > > get
> >>> > > > > > > > >> this
> >>> > > > > > > > >>       timestamp of the earliest messages.
> >>> > > > > > > > >>    5. The downside of this proposal are:
> >>> > > > > > > > >>       - The timestamp might not be monotonically
> >>> increasing.
> >>> > > > > > > > >>       - The log retention might become
> >>> non-deterministic.
> >>> > i.e.
> >>> > > > > When
> >>> > > > > > a
> >>> > > > > > > > >>       message will be deleted now depends on the
> >>> timestamp
> >>> > of
> >>> > > > the
> >>> > > > > > > > >> other messages
> >>> > > > > > > > >>       in the same log segment. And those timestamps
> are
> >>> > > provided
> >>> > > > > by
> >>> > > > > > > > >> user within a
> >>> > > > > > > > >>       range depending on what the time difference
> >>> threshold
> >>> > > > > > > > configuration
> >>> > > > > > > > >> is.
> >>> > > > > > > > >>       - The semantic meaning of the timestamp in the
> >>> > messages
> >>> > > > > could
> >>> > > > > > > be a
> >>> > > > > > > > >>       little bit vague because some of them come from
> >>> the
> >>> > > > producer
> >>> > > > > > and
> >>> > > > > > > > >> some of
> >>> > > > > > > > >>       them are overwritten by brokers.
> >>> > > > > > > > >>       6. Although the proposal has some downsides, it
> >>> gives
> >>> > > user
> >>> > > > > the
> >>> > > > > > > > >>    flexibility to use the timestamp.
> >>> > > > > > > > >>    - If the threshold is set to Long.MaxValue. The
> >>> timestamp
> >>> > > in
> >>> > > > > the
> >>> > > > > > > > message
> >>> > > > > > > > >>       is equivalent to CreateTime.
> >>> > > > > > > > >>       - If the threshold is set to 0. The timestamp in
> >>> the
> >>> > > > message
> >>> > > > > > is
> >>> > > > > > > > >>       equivalent to LogAppendTime.
> >>> > > > > > > > >>
> >>> > > > > > > > >> This proposal actually allows user to use either
> >>> CreateTime
> >>> > or
> >>> > > > > > > > >> LogAppendTime without introducing two timestamp
> concept
> >>> at
> >>> > the
> >>> > > > > same
> >>> > > > > > > > time. I
> >>> > > > > > > > >> have updated the wiki for KIP-32 and KIP-33 with this
> >>> > > proposal.
> >>> > > > > > > > >>
> >>> > > > > > > > >> One thing I am thinking is that instead of having a
> time
> >>> > > > > difference
> >>> > > > > > > > >> threshold, should we simply set have a TimestampType
> >>> > > > > configuration?
> >>> > > > > > > > Because
> >>> > > > > > > > >> in most cases, people will either set the threshold to
> >>> 0 or
> >>> > > > > > > > Long.MaxValue.
> >>> > > > > > > > >> Setting anything in between will make the timestamp in
> >>> the
> >>> > > > message
> >>> > > > > > > > >> meaningless to user - user don't know if the timestamp
> >>> has
> >>> > > been
> >>> > > > > > > > overwritten
> >>> > > > > > > > >> by the brokers.
> >>> > > > > > > > >>
> >>> > > > > > > > >> Any thoughts?
> >>> > > > > > > > >>
> >>> > > > > > > > >> Thanks,
> >>> > > > > > > > >> Jiangjie (Becket) Qin
> >>> > > > > > > > >>
> >>> > > > > > > > >> On Mon, Oct 26, 2015 at 1:23 PM, Jiangjie Qin <
> >>> > > > j...@linkedin.com>
> >>> > > > > > > > wrote:
> >>> > > > > > > > >>
> >>> > > > > > > > >> > Hi Jay,
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > Thanks for such detailed explanation. I think we
> both
> >>> are
> >>> > > > trying
> >>> > > > > > to
> >>> > > > > > > > make
> >>> > > > > > > > >> > CreateTime work for us if possible. To me by "work"
> it
> >>> > means
> >>> > > > > clear
> >>> > > > > > > > >> > guarantees on:
> >>> > > > > > > > >> > 1. Log Retention Time enforcement.
> >>> > > > > > > > >> > 2. Log Rolling time enforcement (This might be less
> a
> >>> > > concern
> >>> > > > as
> >>> > > > > > you
> >>> > > > > > > > >> > pointed out)
> >>> > > > > > > > >> > 3. Application search message by time.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > WRT (1), I agree the expectation for log retention
> >>> might
> >>> > be
> >>> > > > > > > different
> >>> > > > > > > > >> > depending on who we ask. But my concern is about the
> >>> level
> >>> > > of
> >>> > > > > > > > guarantee
> >>> > > > > > > > >> we
> >>> > > > > > > > >> > give to user. My observation is that a clear
> >>> guarantee to
> >>> > > user
> >>> > > > > is
> >>> > > > > > > > >> critical
> >>> > > > > > > > >> > regardless of the mechanism we choose. And this is
> the
> >>> > > subtle
> >>> > > > > but
> >>> > > > > > > > >> important
> >>> > > > > > > > >> > difference between using LogAppendTime and
> CreateTime.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > Let's say user asks this question: How long will my
> >>> > message
> >>> > > > stay
> >>> > > > > > in
> >>> > > > > > > > >> Kafka?
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > If we use LogAppendTime for log retention, the
> answer
> >>> is
> >>> > > > message
> >>> > > > > > > will
> >>> > > > > > > > >> stay
> >>> > > > > > > > >> > in Kafka for retention time after the message is
> >>> produced
> >>> > > (to
> >>> > > > be
> >>> > > > > > > more
> >>> > > > > > > > >> > precise, upper bounded by log.rolling.ms +
> >>> > log.retention.ms
> >>> > > ).
> >>> > > > > > User
> >>> > > > > > > > has a
> >>> > > > > > > > >> > clear guarantee and they may decide whether or not
> to
> >>> put
> >>> > > the
> >>> > > > > > > message
> >>> > > > > > > > >> into
> >>> > > > > > > > >> > Kafka. Or how to adjust the retention time according
> >>> to
> >>> > > their
> >>> > > > > > > > >> requirements.
> >>> > > > > > > > >> > If we use create time for log retention, the answer
> >>> would
> >>> > be
> >>> > > > it
> >>> > > > > > > > depends.
> >>> > > > > > > > >> > The best answer we can give is at least
> retention.ms
> >>> > > because
> >>> > > > > > there
> >>> > > > > > > > is no
> >>> > > > > > > > >> > guarantee when the messages will be deleted after
> >>> that.
> >>> > If a
> >>> > > > > > message
> >>> > > > > > > > sits
> >>> > > > > > > > >> > somewhere behind a larger create time, the message
> >>> might
> >>> > > stay
> >>> > > > > > longer
> >>> > > > > > > > than
> >>> > > > > > > > >> > expected. But we don't know how longer it would be
> >>> because
> >>> > > it
> >>> > > > > > > depends
> >>> > > > > > > > on
> >>> > > > > > > > >> > the create time. In this case, it is hard for user
> to
> >>> > decide
> >>> > > > > what
> >>> > > > > > to
> >>> > > > > > > > do.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > I am worrying about this because a blurring
> guarantee
> >>> has
> >>> > > > bitten
> >>> > > > > > us
> >>> > > > > > > > >> > before, e.g. Topic creation. We have received many
> >>> > questions
> >>> > > > > like
> >>> > > > > > > > "why my
> >>> > > > > > > > >> > topic is not there after I created it". I can
> imagine
> >>> we
> >>> > > > receive
> >>> > > > > > > > similar
> >>> > > > > > > > >> > question asking "why my message is still there after
> >>> > > retention
> >>> > > > > > time
> >>> > > > > > > > has
> >>> > > > > > > > >> > reached". So my understanding is that a clear and
> >>> solid
> >>> > > > > guarantee
> >>> > > > > > is
> >>> > > > > > > > >> better
> >>> > > > > > > > >> > than having a mechanism that works in most cases but
> >>> > > > > occasionally
> >>> > > > > > > does
> >>> > > > > > > > >> not
> >>> > > > > > > > >> > work.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > If we think of the retention guarantee we provide
> with
> >>> > > > > > > LogAppendTime,
> >>> > > > > > > > it
> >>> > > > > > > > >> > is not broken as you said, because we are telling
> >>> user the
> >>> > > log
> >>> > > > > > > > retention
> >>> > > > > > > > >> is
> >>> > > > > > > > >> > NOT based on create time at the first place.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > WRT (3), no matter whether we index on LogAppendTime
> >>> or
> >>> > > > > > CreateTime,
> >>> > > > > > > > the
> >>> > > > > > > > >> > best guarantee we can provide with user is "not
> >>> missing
> >>> > > > message
> >>> > > > > > > after
> >>> > > > > > > > a
> >>> > > > > > > > >> > certain timestamp". Therefore I actually really like
> >>> to
> >>> > > index
> >>> > > > on
> >>> > > > > > > > >> CreateTime
> >>> > > > > > > > >> > because that is the timestamp we provide to user,
> and
> >>> we
> >>> > can
> >>> > > > > have
> >>> > > > > > > the
> >>> > > > > > > > >> solid
> >>> > > > > > > > >> > guarantee.
> >>> > > > > > > > >> > On the other hand, indexing on LogAppendTime and
> >>> giving
> >>> > user
> >>> > > > > > > > CreateTime
> >>> > > > > > > > >> > does not provide solid guarantee when user do search
> >>> based
> >>> > > on
> >>> > > > > > > > timestamp.
> >>> > > > > > > > >> It
> >>> > > > > > > > >> > only works when LogAppendTime is always no earlier
> >>> than
> >>> > > > > > CreateTime.
> >>> > > > > > > > This
> >>> > > > > > > > >> is
> >>> > > > > > > > >> > a reasonable assumption and we can easily enforce
> it.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > With above, I am not sure if we can avoid server
> >>> timestamp
> >>> > > to
> >>> > > > > make
> >>> > > > > > > log
> >>> > > > > > > > >> > retention work with a clear guarantee. For searching
> >>> by
> >>> > > > > timestamp
> >>> > > > > > > use
> >>> > > > > > > > >> case,
> >>> > > > > > > > >> > I really want to have the index built on CreateTime.
> >>> But
> >>> > > with
> >>> > > > a
> >>> > > > > > > > >> reasonable
> >>> > > > > > > > >> > assumption and timestamp enforcement, a
> LogAppendTime
> >>> > index
> >>> > > > > would
> >>> > > > > > > also
> >>> > > > > > > > >> work.
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > Thanks,
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > Jiangjie (Becket) Qin
> >>> > > > > > > > >> >
> >>> > > > > > > > >> >
> >>> > > > > > > > >> >
> >>> > > > > > > > >> > On Thu, Oct 22, 2015 at 10:48 AM, Jay Kreps <
> >>> > > j...@confluent.io
> >>> > > > >
> >>> > > > > > > wrote:
> >>> > > > > > > > >> >
> >>> > > > > > > > >> >> Hey Becket,
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> Let me see if I can address your concerns:
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> 1. Let's say we have two source clusters that are
> >>> > mirrored
> >>> > > to
> >>> > > > > the
> >>> > > > > > > > same
> >>> > > > > > > > >> >> > target cluster. For some reason one of the mirror
> >>> maker
> >>> > > > from
> >>> > > > > a
> >>> > > > > > > > cluster
> >>> > > > > > > > >> >> dies
> >>> > > > > > > > >> >> > and after fix the issue we want to resume
> >>> mirroring. In
> >>> > > > this
> >>> > > > > > case
> >>> > > > > > > > it
> >>> > > > > > > > >> is
> >>> > > > > > > > >> >> > possible that when the mirror maker resumes
> >>> mirroring,
> >>> > > the
> >>> > > > > > > > timestamp
> >>> > > > > > > > >> of
> >>> > > > > > > > >> >> the
> >>> > > > > > > > >> >> > messages have already gone beyond the acceptable
> >>> > > timestamp
> >>> > > > > > range
> >>> > > > > > > on
> >>> > > > > > > > >> >> broker.
> >>> > > > > > > > >> >> > In order to let those messages go through, we
> have
> >>> to
> >>> > > bump
> >>> > > > up
> >>> > > > > > the
> >>> > > > > > > > >> >> > *max.append.delay
> >>> > > > > > > > >> >> > *for all the topics on the target broker. This
> >>> could be
> >>> > > > > > painful.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> Actually what I was suggesting was different. Here
> >>> is my
> >>> > > > > > > observation:
> >>> > > > > > > > >> >> clusters/topics directly produced to by
> applications
> >>> > have a
> >>> > > > > valid
> >>> > > > > > > > >> >> assertion
> >>> > > > > > > > >> >> that log append time and create time are similar
> >>> (let's
> >>> > > call
> >>> > > > > > these
> >>> > > > > > > > >> >> "unbuffered"); other cluster/topic such as those
> that
> >>> > > receive
> >>> > > > > > data
> >>> > > > > > > > from
> >>> > > > > > > > >> a
> >>> > > > > > > > >> >> database, a log file, or another kafka cluster
> don't
> >>> have
> >>> > > > that
> >>> > > > > > > > >> assertion,
> >>> > > > > > > > >> >> for these "buffered" clusters data can be
> arbitrarily
> >>> > late.
> >>> > > > > This
> >>> > > > > > > > means
> >>> > > > > > > > >> any
> >>> > > > > > > > >> >> use of log append time on these buffered clusters
> is
> >>> not
> >>> > > very
> >>> > > > > > > > >> meaningful,
> >>> > > > > > > > >> >> and create time and log append time "should" be
> >>> similar
> >>> > on
> >>> > > > > > > unbuffered
> >>> > > > > > > > >> >> clusters so you can probably use either.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> Using log append time on buffered clusters actually
> >>> > results
> >>> > > > in
> >>> > > > > > bad
> >>> > > > > > > > >> things.
> >>> > > > > > > > >> >> If you request the offset for a given time you get
> >>> don't
> >>> > > end
> >>> > > > up
> >>> > > > > > > > getting
> >>> > > > > > > > >> >> data for that time but rather data that showed up
> at
> >>> that
> >>> > > > time.
> >>> > > > > > If
> >>> > > > > > > > you
> >>> > > > > > > > >> try
> >>> > > > > > > > >> >> to retain 7 days of data it may mostly work but any
> >>> kind
> >>> > of
> >>> > > > > > > > >> bootstrapping
> >>> > > > > > > > >> >> will result in retaining much more (potentially the
> >>> whole
> >>> > > > > > database
> >>> > > > > > > > >> >> contents!).
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> So what I am suggesting in terms of the use of the
> >>> > > > > > max.append.delay
> >>> > > > > > > > is
> >>> > > > > > > > >> >> that
> >>> > > > > > > > >> >> unbuffered clusters would have this set and
> buffered
> >>> > > clusters
> >>> > > > > > would
> >>> > > > > > > > not.
> >>> > > > > > > > >> >> In
> >>> > > > > > > > >> >> other words, in LI terminology, tracking and
> metrics
> >>> > > clusters
> >>> > > > > > would
> >>> > > > > > > > have
> >>> > > > > > > > >> >> this enforced, aggregate and replica clusters
> >>> wouldn't.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> So you DO have the issue of potentially maintaining
> >>> more
> >>> > > data
> >>> > > > > > than
> >>> > > > > > > > you
> >>> > > > > > > > >> >> need
> >>> > > > > > > > >> >> to on aggregate clusters if your mirroring skews,
> >>> but you
> >>> > > > DON'T
> >>> > > > > > > need
> >>> > > > > > > > to
> >>> > > > > > > > >> >> tweak the setting as you described.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> 2. Let's say in the above scenario we let the
> >>> messages
> >>> > in,
> >>> > > at
> >>> > > > > > that
> >>> > > > > > > > point
> >>> > > > > > > > >> >> > some log segments in the target cluster might
> have
> >>> a
> >>> > wide
> >>> > > > > range
> >>> > > > > > > of
> >>> > > > > > > > >> >> > timestamps, like Guozhang mentioned the log
> rolling
> >>> > could
> >>> > > > be
> >>> > > > > > > tricky
> >>> > > > > > > > >> >> because
> >>> > > > > > > > >> >> > the first time index entry does not necessarily
> >>> have
> >>> > the
> >>> > > > > > smallest
> >>> > > > > > > > >> >> timestamp
> >>> > > > > > > > >> >> > of all the messages in the log segment. Instead,
> >>> it is
> >>> > > the
> >>> > > > > > > largest
> >>> > > > > > > > >> >> > timestamp ever seen. We have to scan the entire
> >>> log to
> >>> > > find
> >>> > > > > the
> >>> > > > > > > > >> message
> >>> > > > > > > > >> >> > with smallest offset to see if we should roll.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> I think there are two uses for time-based log
> >>> rolling:
> >>> > > > > > > > >> >> 1. Making the offset lookup by timestamp work
> >>> > > > > > > > >> >> 2. Ensuring we don't retain data indefinitely if it
> >>> is
> >>> > > > supposed
> >>> > > > > > to
> >>> > > > > > > > get
> >>> > > > > > > > >> >> purged after 7 days
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> But think about these two use cases. (1) is totally
> >>> > > obviated
> >>> > > > by
> >>> > > > > > the
> >>> > > > > > > > >> >> time=>offset index we are adding which yields much
> >>> more
> >>> > > > > granular
> >>> > > > > > > > offset
> >>> > > > > > > > >> >> lookups. (2) Is actually totally broken if you
> >>> switch to
> >>> > > > append
> >>> > > > > > > time,
> >>> > > > > > > > >> >> right? If you want to be sure for security/privacy
> >>> > reasons
> >>> > > > you
> >>> > > > > > only
> >>> > > > > > > > >> retain
> >>> > > > > > > > >> >> 7 days of data then if the log append and create
> time
> >>> > > diverge
> >>> > > > > you
> >>> > > > > > > > >> actually
> >>> > > > > > > > >> >> violate this requirement.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> I think 95% of people care about (1) which is
> solved
> >>> in
> >>> > the
> >>> > > > > > > proposal
> >>> > > > > > > > and
> >>> > > > > > > > >> >> (2) is actually broken today as well as in both
> >>> > proposals.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> 3. Theoretically it is possible that an older log
> >>> segment
> >>> > > > > > contains
> >>> > > > > > > > >> >> > timestamps that are older than all the messages
> in
> >>> a
> >>> > > newer
> >>> > > > > log
> >>> > > > > > > > >> segment.
> >>> > > > > > > > >> >> It
> >>> > > > > > > > >> >> > would be weird that we are supposed to delete the
> >>> newer
> >>> > > log
> >>> > > > > > > segment
> >>> > > > > > > > >> >> before
> >>> > > > > > > > >> >> > we delete the older log segment.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> The index timestamps would always be a lower bound
> >>> (i.e.
> >>> > > the
> >>> > > > > > > maximum
> >>> > > > > > > > at
> >>> > > > > > > > >> >> that time) so I don't think that is possible.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >>  4. In bootstrap case, if we reload the data to a
> >>> Kafka
> >>> > > > > cluster,
> >>> > > > > > we
> >>> > > > > > > > have
> >>> > > > > > > > >> >> to
> >>> > > > > > > > >> >> > make sure we configure the topic correctly before
> >>> we
> >>> > load
> >>> > > > the
> >>> > > > > > > data.
> >>> > > > > > > > >> >> > Otherwise the message might either be rejected
> >>> because
> >>> > > the
> >>> > > > > > > > timestamp
> >>> > > > > > > > >> is
> >>> > > > > > > > >> >> too
> >>> > > > > > > > >> >> > old, or it might be deleted immediately because
> the
> >>> > > > retention
> >>> > > > > > > time
> >>> > > > > > > > has
> >>> > > > > > > > >> >> > reached.
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> See (1).
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> -Jay
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> On Tue, Oct 13, 2015 at 7:30 PM, Jiangjie Qin
> >>> > > > > > > > <j...@linkedin.com.invalid
> >>> > > > > > > > >> >
> >>> > > > > > > > >> >> wrote:
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >> > Hey Jay and Guozhang,
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > Thanks a lot for the reply. So if I understand
> >>> > correctly,
> >>> > > > > Jay's
> >>> > > > > > > > >> proposal
> >>> > > > > > > > >> >> > is:
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > 1. Let client stamp the message create time.
> >>> > > > > > > > >> >> > 2. Broker build index based on client-stamped
> >>> message
> >>> > > > create
> >>> > > > > > > time.
> >>> > > > > > > > >> >> > 3. Broker only takes message whose create time is
> >>> > withing
> >>> > > > > > current
> >>> > > > > > > > time
> >>> > > > > > > > >> >> > plus/minus T (T is a configuration
> >>> *max.append.delay*,
> >>> > > > could
> >>> > > > > be
> >>> > > > > > > > topic
> >>> > > > > > > > >> >> level
> >>> > > > > > > > >> >> > configuration), if the timestamp is out of this
> >>> range,
> >>> > > > broker
> >>> > > > > > > > rejects
> >>> > > > > > > > >> >> the
> >>> > > > > > > > >> >> > message.
> >>> > > > > > > > >> >> > 4. Because the create time of messages can be out
> >>> of
> >>> > > order,
> >>> > > > > > when
> >>> > > > > > > > >> broker
> >>> > > > > > > > >> >> > builds the time based index it only provides the
> >>> > > guarantee
> >>> > > > > that
> >>> > > > > > > if
> >>> > > > > > > > a
> >>> > > > > > > > >> >> > consumer starts consuming from the offset
> returned
> >>> by
> >>> > > > > searching
> >>> > > > > > > by
> >>> > > > > > > > >> >> > timestamp t, they will not miss any message
> created
> >>> > after
> >>> > > > t,
> >>> > > > > > but
> >>> > > > > > > > might
> >>> > > > > > > > >> >> see
> >>> > > > > > > > >> >> > some messages created before t.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > To build the time based index, every time when a
> >>> broker
> >>> > > > needs
> >>> > > > > > to
> >>> > > > > > > > >> insert
> >>> > > > > > > > >> >> a
> >>> > > > > > > > >> >> > new time index entry, the entry would be
> >>> > > > > > > > {Largest_Timestamp_Ever_Seen
> >>> > > > > > > > >> ->
> >>> > > > > > > > >> >> > Current_Offset}. This basically means any
> timestamp
> >>> > > larger
> >>> > > > > than
> >>> > > > > > > the
> >>> > > > > > > > >> >> > Largest_Timestamp_Ever_Seen must come after this
> >>> offset
> >>> > > > > because
> >>> > > > > > > it
> >>> > > > > > > > >> never
> >>> > > > > > > > >> >> > saw them before. So we don't miss any message
> with
> >>> > larger
> >>> > > > > > > > timestamp.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > (@Guozhang, in this case, for log retention we
> only
> >>> > need
> >>> > > to
> >>> > > > > > take
> >>> > > > > > > a
> >>> > > > > > > > >> look
> >>> > > > > > > > >> >> at
> >>> > > > > > > > >> >> > the last time index entry, because it must be the
> >>> > largest
> >>> > > > > > > timestamp
> >>> > > > > > > > >> >> ever,
> >>> > > > > > > > >> >> > if that timestamp is overdue, we can safely
> delete
> >>> any
> >>> > > log
> >>> > > > > > > segment
> >>> > > > > > > > >> >> before
> >>> > > > > > > > >> >> > that. So we don't need to scan the log segment
> >>> file for
> >>> > > log
> >>> > > > > > > > retention)
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > I assume that we are still going to have the new
> >>> > > > FetchRequest
> >>> > > > > > to
> >>> > > > > > > > allow
> >>> > > > > > > > >> >> the
> >>> > > > > > > > >> >> > time index replication for replicas.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > I think Jay's main point here is that we don't
> >>> want to
> >>> > > have
> >>> > > > > two
> >>> > > > > > > > >> >> timestamp
> >>> > > > > > > > >> >> > concepts in Kafka, which I agree is a reasonable
> >>> > concern.
> >>> > > > > And I
> >>> > > > > > > > also
> >>> > > > > > > > >> >> agree
> >>> > > > > > > > >> >> > that create time is more meaningful than
> >>> LogAppendTime
> >>> > > for
> >>> > > > > > users.
> >>> > > > > > > > But
> >>> > > > > > > > >> I
> >>> > > > > > > > >> >> am
> >>> > > > > > > > >> >> > not sure if making everything base on Create Time
> >>> would
> >>> > > > work
> >>> > > > > in
> >>> > > > > > > all
> >>> > > > > > > > >> >> cases.
> >>> > > > > > > > >> >> > Here are my questions about this approach:
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > 1. Let's say we have two source clusters that are
> >>> > > mirrored
> >>> > > > to
> >>> > > > > > the
> >>> > > > > > > > same
> >>> > > > > > > > >> >> > target cluster. For some reason one of the mirror
> >>> maker
> >>> > > > from
> >>> > > > > a
> >>> > > > > > > > cluster
> >>> > > > > > > > >> >> dies
> >>> > > > > > > > >> >> > and after fix the issue we want to resume
> >>> mirroring. In
> >>> > > > this
> >>> > > > > > case
> >>> > > > > > > > it
> >>> > > > > > > > >> is
> >>> > > > > > > > >> >> > possible that when the mirror maker resumes
> >>> mirroring,
> >>> > > the
> >>> > > > > > > > timestamp
> >>> > > > > > > > >> of
> >>> > > > > > > > >> >> the
> >>> > > > > > > > >> >> > messages have already gone beyond the acceptable
> >>> > > timestamp
> >>> > > > > > range
> >>> > > > > > > on
> >>> > > > > > > > >> >> broker.
> >>> > > > > > > > >> >> > In order to let those messages go through, we
> have
> >>> to
> >>> > > bump
> >>> > > > up
> >>> > > > > > the
> >>> > > > > > > > >> >> > *max.append.delay
> >>> > > > > > > > >> >> > *for all the topics on the target broker. This
> >>> could be
> >>> > > > > > painful.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > 2. Let's say in the above scenario we let the
> >>> messages
> >>> > > in,
> >>> > > > at
> >>> > > > > > > that
> >>> > > > > > > > >> point
> >>> > > > > > > > >> >> > some log segments in the target cluster might
> have
> >>> a
> >>> > wide
> >>> > > > > range
> >>> > > > > > > of
> >>> > > > > > > > >> >> > timestamps, like Guozhang mentioned the log
> rolling
> >>> > could
> >>> > > > be
> >>> > > > > > > tricky
> >>> > > > > > > > >> >> because
> >>> > > > > > > > >> >> > the first time index entry does not necessarily
> >>> have
> >>> > the
> >>> > > > > > smallest
> >>> > > > > > > > >> >> timestamp
> >>> > > > > > > > >> >> > of all the messages in the log segment. Instead,
> >>> it is
> >>> > > the
> >>> > > > > > > largest
> >>> > > > > > > > >> >> > timestamp ever seen. We have to scan the entire
> >>> log to
> >>> > > find
> >>> > > > > the
> >>> > > > > > > > >> message
> >>> > > > > > > > >> >> > with smallest offset to see if we should roll.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > 3. Theoretically it is possible that an older log
> >>> > segment
> >>> > > > > > > contains
> >>> > > > > > > > >> >> > timestamps that are older than all the messages
> in
> >>> a
> >>> > > newer
> >>> > > > > log
> >>> > > > > > > > >> segment.
> >>> > > > > > > > >> >> It
> >>> > > > > > > > >> >> > would be weird that we are supposed to delete the
> >>> newer
> >>> > > log
> >>> > > > > > > segment
> >>> > > > > > > > >> >> before
> >>> > > > > > > > >> >> > we delete the older log segment.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > 4. In bootstrap case, if we reload the data to a
> >>> Kafka
> >>> > > > > cluster,
> >>> > > > > > > we
> >>> > > > > > > > >> have
> >>> > > > > > > > >> >> to
> >>> > > > > > > > >> >> > make sure we configure the topic correctly before
> >>> we
> >>> > load
> >>> > > > the
> >>> > > > > > > data.
> >>> > > > > > > > >> >> > Otherwise the message might either be rejected
> >>> because
> >>> > > the
> >>> > > > > > > > timestamp
> >>> > > > > > > > >> is
> >>> > > > > > > > >> >> too
> >>> > > > > > > > >> >> > old, or it might be deleted immediately because
> the
> >>> > > > retention
> >>> > > > > > > time
> >>> > > > > > > > has
> >>> > > > > > > > >> >> > reached.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > I am very concerned about the operational
> overhead
> >>> and
> >>> > > the
> >>> > > > > > > > ambiguity
> >>> > > > > > > > >> of
> >>> > > > > > > > >> >> > guarantees we introduce if we purely rely on
> >>> > CreateTime.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > It looks to me that the biggest issue of adopting
> >>> > > > CreateTime
> >>> > > > > > > > >> everywhere
> >>> > > > > > > > >> >> is
> >>> > > > > > > > >> >> > CreateTime can have big gaps. These gaps could be
> >>> > caused
> >>> > > by
> >>> > > > > > > several
> >>> > > > > > > > >> >> cases:
> >>> > > > > > > > >> >> > [1]. Faulty clients
> >>> > > > > > > > >> >> > [2]. Natural delays from different source
> >>> > > > > > > > >> >> > [3]. Bootstrap
> >>> > > > > > > > >> >> > [4]. Failure recovery
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > Jay's alternative proposal solves [1], perhaps
> >>> solve
> >>> > [2]
> >>> > > as
> >>> > > > > > well
> >>> > > > > > > > if we
> >>> > > > > > > > >> >> are
> >>> > > > > > > > >> >> > able to set a reasonable max.append.delay. But it
> >>> does
> >>> > > not
> >>> > > > > seem
> >>> > > > > > > > >> address
> >>> > > > > > > > >> >> [3]
> >>> > > > > > > > >> >> > and [4]. I actually doubt if [3] and [4] are
> >>> solvable
> >>> > > > because
> >>> > > > > > it
> >>> > > > > > > > looks
> >>> > > > > > > > >> >> the
> >>> > > > > > > > >> >> > CreateTime gap is unavoidable in those two cases.
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > Thanks,
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > Jiangjie (Becket) Qin
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > On Tue, Oct 13, 2015 at 3:23 PM, Guozhang Wang <
> >>> > > > > > > wangg...@gmail.com
> >>> > > > > > > > >
> >>> > > > > > > > >> >> wrote:
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > > Just to complete Jay's option, here is my
> >>> > > understanding:
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > > 1. For log retention: if we want to remove data
> >>> > before
> >>> > > > time
> >>> > > > > > t,
> >>> > > > > > > we
> >>> > > > > > > > >> look
> >>> > > > > > > > >> >> > into
> >>> > > > > > > > >> >> > > the index file of each segment and find the
> >>> largest
> >>> > > > > timestamp
> >>> > > > > > > t'
> >>> > > > > > > > <
> >>> > > > > > > > >> t,
> >>> > > > > > > > >> >> > find
> >>> > > > > > > > >> >> > > the corresponding timestamp and start scanning
> >>> to the
> >>> > > end
> >>> > > > > of
> >>> > > > > > > the
> >>> > > > > > > > >> >> segment,
> >>> > > > > > > > >> >> > > if there is no entry with timestamp >= t, we
> can
> >>> > delete
> >>> > > > > this
> >>> > > > > > > > >> segment;
> >>> > > > > > > > >> >> if
> >>> > > > > > > > >> >> > a
> >>> > > > > > > > >> >> > > segment's index smallest timestamp is larger
> >>> than t,
> >>> > we
> >>> > > > can
> >>> > > > > > > skip
> >>> > > > > > > > >> that
> >>> > > > > > > > >> >> > > segment.
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > > 2. For log rolling: if we want to start a new
> >>> segment
> >>> > > > after
> >>> > > > > > > time
> >>> > > > > > > > t,
> >>> > > > > > > > >> we
> >>> > > > > > > > >> >> > look
> >>> > > > > > > > >> >> > > into the active segment's index file, if the
> >>> largest
> >>> > > > > > timestamp
> >>> > > > > > > is
> >>> > > > > > > > >> >> > already >
> >>> > > > > > > > >> >> > > t, we can roll a new segment immediately; if it
> >>> is <
> >>> > t,
> >>> > > > we
> >>> > > > > > read
> >>> > > > > > > > its
> >>> > > > > > > > >> >> > > corresponding offset and start scanning to the
> >>> end of
> >>> > > the
> >>> > > > > > > > segment,
> >>> > > > > > > > >> if
> >>> > > > > > > > >> >> we
> >>> > > > > > > > >> >> > > find a record whose timestamp > t, we can roll
> a
> >>> new
> >>> > > > > segment.
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > > For log rolling we only need to possibly scan a
> >>> small
> >>> > > > > portion
> >>> > > > > > > the
> >>> > > > > > > > >> >> active
> >>> > > > > > > > >> >> > > segment, which should be fine; for log
> retention
> >>> we
> >>> > may
> >>> > > > in
> >>> > > > > > the
> >>> > > > > > > > worst
> >>> > > > > > > > >> >> case
> >>> > > > > > > > >> >> > > end up scanning all segments, but in practice
> we
> >>> may
> >>> > > skip
> >>> > > > > > most
> >>> > > > > > > of
> >>> > > > > > > > >> them
> >>> > > > > > > > >> >> > > since their smallest timestamp in the index
> file
> >>> is
> >>> > > > larger
> >>> > > > > > than
> >>> > > > > > > > t.
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > > Guozhang
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > > On Tue, Oct 13, 2015 at 12:52 AM, Jay Kreps <
> >>> > > > > > j...@confluent.io>
> >>> > > > > > > > >> wrote:
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > > > I think it should be possible to index
> >>> out-of-order
> >>> > > > > > > timestamps.
> >>> > > > > > > > >> The
> >>> > > > > > > > >> >> > > > timestamp index would be similar to the
> offset
> >>> > > index, a
> >>> > > > > > > memory
> >>> > > > > > > > >> >> mapped
> >>> > > > > > > > >> >> > > file
> >>> > > > > > > > >> >> > > > appended to as part of the log append, but
> >>> would
> >>> > have
> >>> > > > the
> >>> > > > > > > > format
> >>> > > > > > > > >> >> > > >   timestamp offset
> >>> > > > > > > > >> >> > > > The timestamp entries would be monotonic and
> as
> >>> > with
> >>> > > > the
> >>> > > > > > > offset
> >>> > > > > > > > >> >> index
> >>> > > > > > > > >> >> > > would
> >>> > > > > > > > >> >> > > > be no more often then every 4k (or some
> >>> > configurable
> >>> > > > > > > threshold
> >>> > > > > > > > to
> >>> > > > > > > > >> >> keep
> >>> > > > > > > > >> >> > > the
> >>> > > > > > > > >> >> > > > index small--actually for timestamp it could
> >>> > probably
> >>> > > > be
> >>> > > > > > much
> >>> > > > > > > > more
> >>> > > > > > > > >> >> > sparse
> >>> > > > > > > > >> >> > > > than 4k).
> >>> > > > > > > > >> >> > > >
> >>> > > > > > > > >> >> > > > A search for a timestamp t yields an offset o
> >>> > before
> >>> > > > > which
> >>> > > > > > no
> >>> > > > > > > > >> prior
> >>> > > > > > > > >> >> > > message
> >>> > > > > > > > >> >> > > > has a timestamp >= t. In other words if you
> >>> read
> >>> > the
> >>> > > > log
> >>> > > > > > > > starting
> >>> > > > > > > > >> >> with
> >>> > > > > > > > >> >> > o
> >>> > > > > > > > >> >> > > > you are guaranteed not to miss any messages
> >>> > occurring
> >>> > > > at
> >>> > > > > t
> >>> > > > > > or
> >>> > > > > > > > >> later
> >>> > > > > > > > >> >> > > though
> >>> > > > > > > > >> >> > > > you may get many before t (due to
> >>> > out-of-orderness).
> >>> > > > > Unlike
> >>> > > > > > > the
> >>> > > > > > > > >> >> offset
> >>> > > > > > > > >> >> > > > index this bound doesn't really have to be
> >>> tight
> >>> > > (i.e.
> >>> > > > > > > > probably no
> >>> > > > > > > > >> >> need
> >>> > > > > > > > >> >> > > to
> >>> > > > > > > > >> >> > > > go search the log itself, though you could).
> >>> > > > > > > > >> >> > > >
> >>> > > > > > > > >> >> > > > -Jay
> >>> > > > > > > > >> >> > > >
> >>> > > > > > > > >> >> > > > On Tue, Oct 13, 2015 at 12:32 AM, Jay Kreps <
> >>> > > > > > > j...@confluent.io>
> >>> > > > > > > > >> >> wrote:
> >>> > > > > > > > >> >> > > >
> >>> > > > > > > > >> >> > > > > Here's my basic take:
> >>> > > > > > > > >> >> > > > > - I agree it would be nice to have a notion
> >>> of
> >>> > time
> >>> > > > > baked
> >>> > > > > > > in
> >>> > > > > > > > if
> >>> > > > > > > > >> it
> >>> > > > > > > > >> >> > were
> >>> > > > > > > > >> >> > > > > done right
> >>> > > > > > > > >> >> > > > > - All the proposals so far seem pretty
> >>> complex--I
> >>> > > > think
> >>> > > > > > > they
> >>> > > > > > > > >> might
> >>> > > > > > > > >> >> > make
> >>> > > > > > > > >> >> > > > > things worse rather than better overall
> >>> > > > > > > > >> >> > > > > - I think adding 2x8 byte timestamps to the
> >>> > message
> >>> > > > is
> >>> > > > > > > > probably
> >>> > > > > > > > >> a
> >>> > > > > > > > >> >> > > > > non-starter from a size perspective
> >>> > > > > > > > >> >> > > > > - Even if it isn't in the message, having
> two
> >>> > > notions
> >>> > > > > of
> >>> > > > > > > time
> >>> > > > > > > > >> that
> >>> > > > > > > > >> >> > > > control
> >>> > > > > > > > >> >> > > > > different things is a bit confusing
> >>> > > > > > > > >> >> > > > > - The mechanics of basing retention etc on
> >>> log
> >>> > > append
> >>> > > > > > time
> >>> > > > > > > > when
> >>> > > > > > > > >> >> > that's
> >>> > > > > > > > >> >> > > > not
> >>> > > > > > > > >> >> > > > > in the log seem complicated
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > To that end here is a possible 4th option.
> >>> Let me
> >>> > > > know
> >>> > > > > > what
> >>> > > > > > > > you
> >>> > > > > > > > >> >> > think.
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > The basic idea is that the message creation
> >>> time
> >>> > is
> >>> > > > > > closest
> >>> > > > > > > > to
> >>> > > > > > > > >> >> what
> >>> > > > > > > > >> >> > the
> >>> > > > > > > > >> >> > > > > user actually cares about but is dangerous
> >>> if set
> >>> > > > > wrong.
> >>> > > > > > So
> >>> > > > > > > > >> rather
> >>> > > > > > > > >> >> > than
> >>> > > > > > > > >> >> > > > > substitute another notion of time, let's
> try
> >>> to
> >>> > > > ensure
> >>> > > > > > the
> >>> > > > > > > > >> >> > correctness
> >>> > > > > > > > >> >> > > of
> >>> > > > > > > > >> >> > > > > message creation time by preventing
> >>> arbitrarily
> >>> > bad
> >>> > > > > > message
> >>> > > > > > > > >> >> creation
> >>> > > > > > > > >> >> > > > times.
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > First, let's see if we can agree that log
> >>> append
> >>> > > time
> >>> > > > > is
> >>> > > > > > > not
> >>> > > > > > > > >> >> > something
> >>> > > > > > > > >> >> > > > > anyone really cares about but rather an
> >>> > > > implementation
> >>> > > > > > > > detail.
> >>> > > > > > > > >> The
> >>> > > > > > > > >> >> > > > > timestamp that matters to the user is when
> >>> the
> >>> > > > message
> >>> > > > > > > > occurred
> >>> > > > > > > > >> >> (the
> >>> > > > > > > > >> >> > > > > creation time). The log append time is
> >>> basically
> >>> > > just
> >>> > > > > an
> >>> > > > > > > > >> >> > approximation
> >>> > > > > > > > >> >> > > to
> >>> > > > > > > > >> >> > > > > this on the assumption that the message
> >>> creation
> >>> > > and
> >>> > > > > the
> >>> > > > > > > > message
> >>> > > > > > > > >> >> > > receive
> >>> > > > > > > > >> >> > > > on
> >>> > > > > > > > >> >> > > > > the server occur pretty close together and
> >>> the
> >>> > > reason
> >>> > > > > to
> >>> > > > > > > > prefer
> >>> > > > > > > > >> .
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > But as these values diverge the issue
> starts
> >>> to
> >>> > > > become
> >>> > > > > > > > apparent.
> >>> > > > > > > > >> >> Say
> >>> > > > > > > > >> >> > > you
> >>> > > > > > > > >> >> > > > > set the retention to one week and then
> mirror
> >>> > data
> >>> > > > > from a
> >>> > > > > > > > topic
> >>> > > > > > > > >> >> > > > containing
> >>> > > > > > > > >> >> > > > > two years of retention. Your intention is
> >>> clearly
> >>> > > to
> >>> > > > > keep
> >>> > > > > > > the
> >>> > > > > > > > >> last
> >>> > > > > > > > >> >> > > week,
> >>> > > > > > > > >> >> > > > > but because the mirroring is appending
> right
> >>> now
> >>> > > you
> >>> > > > > will
> >>> > > > > > > > keep
> >>> > > > > > > > >> two
> >>> > > > > > > > >> >> > > years.
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > The reason we are liking log append time is
> >>> > because
> >>> > > > we
> >>> > > > > > are
> >>> > > > > > > > >> >> > > (justifiably)
> >>> > > > > > > > >> >> > > > > concerned that in certain situations the
> >>> creation
> >>> > > > time
> >>> > > > > > may
> >>> > > > > > > > not
> >>> > > > > > > > >> be
> >>> > > > > > > > >> >> > > > > trustworthy. This same problem exists on
> the
> >>> > > servers
> >>> > > > > but
> >>> > > > > > > > there
> >>> > > > > > > > >> are
> >>> > > > > > > > >> >> > > fewer
> >>> > > > > > > > >> >> > > > > servers and they just run the kafka code so
> >>> it is
> >>> > > > less
> >>> > > > > of
> >>> > > > > > > an
> >>> > > > > > > > >> >> issue.
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > There are two possible ways to handle this:
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > >    1. Just tell people to add size based
> >>> > > retention. I
> >>> > > > > > think
> >>> > > > > > > > this
> >>> > > > > > > > >> >> is
> >>> > > > > > > > >> >> > not
> >>> > > > > > > > >> >> > > > >    entirely unreasonable, we're basically
> >>> saying
> >>> > we
> >>> > > > > > retain
> >>> > > > > > > > data
> >>> > > > > > > > >> >> based
> >>> > > > > > > > >> >> > > on
> >>> > > > > > > > >> >> > > > the
> >>> > > > > > > > >> >> > > > >    timestamp you give us in the data. If
> you
> >>> give
> >>> > > us
> >>> > > > > bad
> >>> > > > > > > > data we
> >>> > > > > > > > >> >> will
> >>> > > > > > > > >> >> > > > retain
> >>> > > > > > > > >> >> > > > >    it for a bad amount of time. If you want
> >>> to
> >>> > > ensure
> >>> > > > > we
> >>> > > > > > > > don't
> >>> > > > > > > > >> >> retain
> >>> > > > > > > > >> >> > > > "too
> >>> > > > > > > > >> >> > > > >    much" data, define "too much" by
> setting a
> >>> > > > > time-based
> >>> > > > > > > > >> retention
> >>> > > > > > > > >> >> > > > setting.
> >>> > > > > > > > >> >> > > > >    This is not entirely unreasonable but
> >>> kind of
> >>> > > > > suffers
> >>> > > > > > > > from a
> >>> > > > > > > > >> >> "one
> >>> > > > > > > > >> >> > > bad
> >>> > > > > > > > >> >> > > > >    apple" problem in a very large
> >>> environment.
> >>> > > > > > > > >> >> > > > >    2. Prevent bad timestamps. In general we
> >>> can't
> >>> > > > say a
> >>> > > > > > > > >> timestamp
> >>> > > > > > > > >> >> is
> >>> > > > > > > > >> >> > > bad.
> >>> > > > > > > > >> >> > > > >    However the definition we're implicitly
> >>> using
> >>> > is
> >>> > > > > that
> >>> > > > > > we
> >>> > > > > > > > >> think
> >>> > > > > > > > >> >> > there
> >>> > > > > > > > >> >> > > > are a
> >>> > > > > > > > >> >> > > > >    set of topics/clusters where the
> creation
> >>> > > > timestamp
> >>> > > > > > > should
> >>> > > > > > > > >> >> always
> >>> > > > > > > > >> >> > be
> >>> > > > > > > > >> >> > > > "very
> >>> > > > > > > > >> >> > > > >    close" to the log append timestamp. This
> >>> is
> >>> > true
> >>> > > > for
> >>> > > > > > > data
> >>> > > > > > > > >> >> sources
> >>> > > > > > > > >> >> > > > that have
> >>> > > > > > > > >> >> > > > >    no buffering capability (which at
> >>> LinkedIn is
> >>> > > very
> >>> > > > > > > common,
> >>> > > > > > > > >> but
> >>> > > > > > > > >> >> is
> >>> > > > > > > > >> >> > > > more rare
> >>> > > > > > > > >> >> > > > >    elsewhere). The solution in this case
> >>> would be
> >>> > > to
> >>> > > > > > allow
> >>> > > > > > > a
> >>> > > > > > > > >> >> setting
> >>> > > > > > > > >> >> > > > along the
> >>> > > > > > > > >> >> > > > >    lines of max.append.delay which checks
> the
> >>> > > > creation
> >>> > > > > > > > timestamp
> >>> > > > > > > > >> >> > > against
> >>> > > > > > > > >> >> > > > the
> >>> > > > > > > > >> >> > > > >    server time to look for too large a
> >>> > divergence.
> >>> > > > The
> >>> > > > > > > > solution
> >>> > > > > > > > >> >> would
> >>> > > > > > > > >> >> > > > either
> >>> > > > > > > > >> >> > > > >    be to reject the message or to override
> it
> >>> > with
> >>> > > > the
> >>> > > > > > > server
> >>> > > > > > > > >> >> time.
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > So in LI's environment you would configure
> >>> the
> >>> > > > clusters
> >>> > > > > > > used
> >>> > > > > > > > for
> >>> > > > > > > > >> >> > > direct,
> >>> > > > > > > > >> >> > > > > unbuffered, message production (e.g.
> >>> tracking and
> >>> > > > > metrics
> >>> > > > > > > > local)
> >>> > > > > > > > >> >> to
> >>> > > > > > > > >> >> > > > enforce
> >>> > > > > > > > >> >> > > > > a reasonably aggressive timestamp bound
> (say
> >>> 10
> >>> > > > mins),
> >>> > > > > > and
> >>> > > > > > > > all
> >>> > > > > > > > >> >> other
> >>> > > > > > > > >> >> > > > > clusters would just inherent these.
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > The downside of this approach is requiring
> >>> the
> >>> > > > special
> >>> > > > > > > > >> >> configuration.
> >>> > > > > > > > >> >> > > > > However I think in 99% of environments this
> >>> could
> >>> > > be
> >>> > > > > > > skipped
> >>> > > > > > > > >> >> > entirely,
> >>> > > > > > > > >> >> > > > it's
> >>> > > > > > > > >> >> > > > > only when the ratio of clients to servers
> >>> gets so
> >>> > > > > massive
> >>> > > > > > > > that
> >>> > > > > > > > >> you
> >>> > > > > > > > >> >> > need
> >>> > > > > > > > >> >> > > > to
> >>> > > > > > > > >> >> > > > > do this. The primary upside is that you
> have
> >>> a
> >>> > > single
> >>> > > > > > > > >> >> authoritative
> >>> > > > > > > > >> >> > > > notion
> >>> > > > > > > > >> >> > > > > of time which is closest to what a user
> would
> >>> > want
> >>> > > > and
> >>> > > > > is
> >>> > > > > > > > stored
> >>> > > > > > > > >> >> > > directly
> >>> > > > > > > > >> >> > > > > in the message.
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > I'm also assuming there is a workable
> >>> approach
> >>> > for
> >>> > > > > > indexing
> >>> > > > > > > > >> >> > > non-monotonic
> >>> > > > > > > > >> >> > > > > timestamps, though I haven't actually
> worked
> >>> that
> >>> > > > out.
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > -Jay
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > > On Mon, Oct 5, 2015 at 8:52 PM, Jiangjie
> Qin
> >>> > > > > > > > >> >> > <j...@linkedin.com.invalid
> >>> > > > > > > > >> >> > > >
> >>> > > > > > > > >> >> > > > > wrote:
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > >> Bumping up this thread although most of
> the
> >>> > > > discussion
> >>> > > > > > > were
> >>> > > > > > > > on
> >>> > > > > > > > >> >> the
> >>> > > > > > > > >> >> > > > >> discussion thread of KIP-31 :)
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >> I just updated the KIP page to add
> detailed
> >>> > > solution
> >>> > > > > for
> >>> > > > > > > the
> >>> > > > > > > > >> >> option
> >>> > > > > > > > >> >> > > > >> (Option
> >>> > > > > > > > >> >> > > > >> 3) that does not expose the LogAppendTime
> to
> >>> > user.
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > >
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >>
> >>> > > > > > > > >>
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-32+-+Add+CreateTime+and+LogAppendTime+to+Kafka+message
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >> The option has a minor change to the fetch
> >>> > request
> >>> > > > to
> >>> > > > > > > allow
> >>> > > > > > > > >> >> fetching
> >>> > > > > > > > >> >> > > > time
> >>> > > > > > > > >> >> > > > >> index entry as well. I kind of like this
> >>> > solution
> >>> > > > > > because
> >>> > > > > > > > its
> >>> > > > > > > > >> >> just
> >>> > > > > > > > >> >> > > doing
> >>> > > > > > > > >> >> > > > >> what we need without introducing other
> >>> things.
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >> It will be great to see what are the
> >>> feedback. I
> >>> > > can
> >>> > > > > > > explain
> >>> > > > > > > > >> more
> >>> > > > > > > > >> >> > > during
> >>> > > > > > > > >> >> > > > >> tomorrow's KIP hangout.
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >> Thanks,
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >> Jiangjie (Becket) Qin
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >> On Thu, Sep 10, 2015 at 2:47 PM, Jiangjie
> >>> Qin <
> >>> > > > > > > > >> j...@linkedin.com
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >> > > > wrote:
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >> > Hi Jay,
> >>> > > > > > > > >> >> > > > >> >
> >>> > > > > > > > >> >> > > > >> > I just copy/pastes here your feedback on
> >>> the
> >>> > > > > timestamp
> >>> > > > > > > > >> proposal
> >>> > > > > > > > >> >> > that
> >>> > > > > > > > >> >> > > > was
> >>> > > > > > > > >> >> > > > >> > in the discussion thread of KIP-31.
> >>> Please see
> >>> > > the
> >>> > > > > > > replies
> >>> > > > > > > > >> >> inline.
> >>> > > > > > > > >> >> > > > >> > The main change I made compared with
> >>> previous
> >>> > > > > proposal
> >>> > > > > > > is
> >>> > > > > > > > to
> >>> > > > > > > > >> >> add
> >>> > > > > > > > >> >> > > both
> >>> > > > > > > > >> >> > > > >> > CreateTime and LogAppendTime to the
> >>> message.
> >>> > > > > > > > >> >> > > > >> >
> >>> > > > > > > > >> >> > > > >> > On Tue, Sep 8, 2015 at 10:57 AM, Jay
> >>> Kreps <
> >>> > > > > > > > j...@confluent.io
> >>> > > > > > > > >> >
> >>> > > > > > > > >> >> > > wrote:
> >>> > > > > > > > >> >> > > > >> >
> >>> > > > > > > > >> >> > > > >> > > Hey Beckett,
> >>> > > > > > > > >> >> > > > >> > >
> >>> > > > > > > > >> >> > > > >> > > I was proposing splitting up the KIP
> >>> just
> >>> > for
> >>> > > > > > > > simplicity of
> >>> > > > > > > > >> >> > > > >> discussion.
> >>> > > > > > > > >> >> > > > >> > You
> >>> > > > > > > > >> >> > > > >> > > can still implement them in one
> patch. I
> >>> > think
> >>> > > > > > > > otherwise it
> >>> > > > > > > > >> >> will
> >>> > > > > > > > >> >> > > be
> >>> > > > > > > > >> >> > > > >> hard
> >>> > > > > > > > >> >> > > > >> > to
> >>> > > > > > > > >> >> > > > >> > > discuss/vote on them since if you like
> >>> the
> >>> > > > offset
> >>> > > > > > > > proposal
> >>> > > > > > > > >> >> but
> >>> > > > > > > > >> >> > not
> >>> > > > > > > > >> >> > > > the
> >>> > > > > > > > >> >> > > > >> > time
> >>> > > > > > > > >> >> > > > >> > > proposal what do you do?
> >>> > > > > > > > >> >> > > > >> > >
> >>> > > > > > > > >> >> > > > >> > > Introducing a second notion of time
> into
> >>> > Kafka
> >>> > > > is
> >>> > > > > a
> >>> > > > > > > > pretty
> >>> > > > > > > > >> >> > massive
> >>> > > > > > > > >> >> > > > >> > > philosophical change so it kind of
> >>> warrants
> >>> > > it's
> >>> > > > > own
> >>> > > > > > > > KIP I
> >>> > > > > > > > >> >> think
> >>> > > > > > > > >> >> > > it
> >>> > > > > > > > >> >> > > > >> > isn't
> >>> > > > > > > > >> >> > > > >> > > just "Change message format".
> >>> > > > > > > > >> >> > > > >> > >
> >>> > > > > > > > >> >> > > > >> > > WRT time I think one thing to clarify
> >>> in the
> >>> > > > > > proposal
> >>> > > > > > > is
> >>> > > > > > > > >> how
> >>> > > > > > > > >> >> MM
> >>> > > > > > > > >> >> > > will
> >>> > > > > > > > >> >> > > > >> have
> >>> > > > > > > > >> >> > > > >> > > access to set the timestamp?
> Presumably
> >>> this
> >>> > > > will
> >>> > > > > > be a
> >>> > > > > > > > new
> >>> > > > > > > > >> >> field
> >>> > > > > > > > >> >> > > in
> >>> > > > > > > > >> >> > > > >> > > ProducerRecord, right? If so then any
> >>> user
> >>> > can
> >>> > > > set
> >>> > > > > > the
> >>> > > > > > > > >> >> > timestamp,
> >>> > > > > > > > >> >> > > > >> right?
> >>> > > > > > > > >> >> > > > >> > > I'm not sure you answered the
> questions
> >>> > around
> >>> > > > how
> >>> > > > > > > this
> >>> > > > > > > > >> will
> >>> > > > > > > > >> >> > work
> >>> > > > > > > > >> >> > > > for
> >>> > > > > > > > >> >> > > > >> MM
> >>> > > > > > > > >> >> > > > >> > > since when MM retains timestamps from
> >>> > multiple
> >>> > > > > > > > partitions
> >>> > > > > > > > >> >> they
> >>> > > > > > > > >> >> > > will
> >>> > > > > > > > >> >> > > > >> then
> >>> > > > > > > > >> >> > > > >> > be
> >>> > > > > > > > >> >> > > > >> > > out of order and in the past (so the
> >>> > > > > > > > >> >> max(lastAppendedTimestamp,
> >>> > > > > > > > >> >> > > > >> > > currentTimeMillis) override you
> proposed
> >>> > will
> >>> > > > not
> >>> > > > > > > work,
> >>> > > > > > > > >> >> right?).
> >>> > > > > > > > >> >> > > If
> >>> > > > > > > > >> >> > > > we
> >>> > > > > > > > >> >> > > > >> > > don't do this then when you set up
> >>> mirroring
> >>> > > the
> >>> > > > > > data
> >>> > > > > > > > will
> >>> > > > > > > > >> >> all
> >>> > > > > > > > >> >> > be
> >>> > > > > > > > >> >> > > > new
> >>> > > > > > > > >> >> > > > >> and
> >>> > > > > > > > >> >> > > > >> > > you have the same retention problem
> you
> >>> > > > described.
> >>> > > > > > > > Maybe I
> >>> > > > > > > > >> >> > missed
> >>> > > > > > > > >> >> > > > >> > > something...?
> >>> > > > > > > > >> >> > > > >> > lastAppendedTimestamp means the
> timestamp
> >>> of
> >>> > the
> >>> > > > > > message
> >>> > > > > > > > that
> >>> > > > > > > > >> >> last
> >>> > > > > > > > >> >> > > > >> > appended to the log.
> >>> > > > > > > > >> >> > > > >> > If a broker is a leader, since it will
> >>> assign
> >>> > > the
> >>> > > > > > > > timestamp
> >>> > > > > > > > >> by
> >>> > > > > > > > >> >> > > itself,
> >>> > > > > > > > >> >> > > > >> the
> >>> > > > > > > > >> >> > > > >> > lastAppenedTimestamp will be its local
> >>> clock
> >>> > > when
> >>> > > > > > append
> >>> > > > > > > > the
> >>> > > > > > > > >> >> last
> >>> > > > > > > > >> >> > > > >> message.
> >>> > > > > > > > >> >> > > > >> > So if there is no leader migration,
> >>> > > > > > > > >> max(lastAppendedTimestamp,
> >>> > > > > > > > >> >> > > > >> > currentTimeMillis) = currentTimeMillis.
> >>> > > > > > > > >> >> > > > >> > If a broker is a follower, because it
> will
> >>> > keep
> >>> > > > the
> >>> > > > > > > > leader's
> >>> > > > > > > > >> >> > > timestamp
> >>> > > > > > > > >> >> > > > >> > unchanged, the lastAppendedTime would be
> >>> the
> >>> > > > > leader's
> >>> > > > > > > > clock
> >>> > > > > > > > >> >> when
> >>> > > > > > > > >> >> > it
> >>> > > > > > > > >> >> > > > >> appends
> >>> > > > > > > > >> >> > > > >> > that message message. It keeps track of
> >>> the
> >>> > > > > > > > lastAppendedTime
> >>> > > > > > > > >> >> only
> >>> > > > > > > > >> >> > in
> >>> > > > > > > > >> >> > > > >> case
> >>> > > > > > > > >> >> > > > >> > it becomes leader later on. At that
> >>> point, it
> >>> > is
> >>> > > > > > > possible
> >>> > > > > > > > >> that
> >>> > > > > > > > >> >> the
> >>> > > > > > > > >> >> > > > >> > timestamp of the last appended message
> was
> >>> > > stamped
> >>> > > > > by
> >>> > > > > > > old
> >>> > > > > > > > >> >> leader,
> >>> > > > > > > > >> >> > > but
> >>> > > > > > > > >> >> > > > >> the
> >>> > > > > > > > >> >> > > > >> > new leader's currentTimeMillis <
> >>> > > lastAppendedTime.
> >>> > > > > If
> >>> > > > > > a
> >>> > > > > > > > new
> >>> > > > > > > > >> >> > message
> >>> > > > > > > > >> >> > > > >> comes,
> >>> > > > > > > > >> >> > > > >> > instead of stamp it with new leader's
> >>> > > > > > currentTimeMillis,
> >>> > > > > > > > we
> >>> > > > > > > > >> >> have
> >>> > > > > > > > >> >> > to
> >>> > > > > > > > >> >> > > > >> stamp
> >>> > > > > > > > >> >> > > > >> > it to lastAppendedTime to avoid the
> >>> timestamp
> >>> > in
> >>> > > > the
> >>> > > > > > log
> >>> > > > > > > > >> going
> >>> > > > > > > > >> >> > > > backward.
> >>> > > > > > > > >> >> > > > >> > The max(lastAppendedTimestamp,
> >>> > > currentTimeMillis)
> >>> > > > is
> >>> > > > > > > > purely
> >>> > > > > > > > >> >> based
> >>> > > > > > > > >> >> > on
> >>> > > > > > > > >> >> > > > the
> >>> > > > > > > > >> >> > > > >> > broker side clock. If MM produces
> message
> >>> with
> >>> > > > > > different
> >>> > > > > > > > >> >> > > LogAppendTime
> >>> > > > > > > > >> >> > > > >> in
> >>> > > > > > > > >> >> > > > >> > source clusters to the same target
> >>> cluster,
> >>> > the
> >>> > > > > > > > LogAppendTime
> >>> > > > > > > > >> >> will
> >>> > > > > > > > >> >> > > be
> >>> > > > > > > > >> >> > > > >> > ignored  re-stamped by target cluster.
> >>> > > > > > > > >> >> > > > >> > I added a use case example for mirror
> >>> maker in
> >>> > > > > KIP-32.
> >>> > > > > > > > Also
> >>> > > > > > > > >> >> there
> >>> > > > > > > > >> >> > > is a
> >>> > > > > > > > >> >> > > > >> > corner case discussion about when we
> need
> >>> the
> >>> > > > > > > > >> >> > max(lastAppendedTime,
> >>> > > > > > > > >> >> > > > >> > currentTimeMillis) trick. Could you
> take a
> >>> > look
> >>> > > > and
> >>> > > > > > see
> >>> > > > > > > if
> >>> > > > > > > > >> that
> >>> > > > > > > > >> >> > > > answers
> >>> > > > > > > > >> >> > > > >> > your question?
> >>> > > > > > > > >> >> > > > >> >
> >>> > > > > > > > >> >> > > > >> > >
> >>> > > > > > > > >> >> > > > >> > > My main motivation is that given that
> >>> both
> >>> > > Samza
> >>> > > > > and
> >>> > > > > > > > Kafka
> >>> > > > > > > > >> >> > streams
> >>> > > > > > > > >> >> > > > are
> >>> > > > > > > > >> >> > > > >> > > doing work that implies a mandatory
> >>> > > > client-defined
> >>> > > > > > > > notion
> >>> > > > > > > > >> of
> >>> > > > > > > > >> >> > > time, I
> >>> > > > > > > > >> >> > > > >> > really
> >>> > > > > > > > >> >> > > > >> > > think introducing a different
> mandatory
> >>> > notion
> >>> > > > of
> >>> > > > > > time
> >>> > > > > > > > in
> >>> > > > > > > > >> >> Kafka
> >>> > > > > > > > >> >> > is
> >>> > > > > > > > >> >> > > > >> going
> >>> > > > > > > > >> >> > > > >> > to
> >>> > > > > > > > >> >> > > > >> > > be quite odd. We should think hard
> >>> about how
> >>> > > > > > > > client-defined
> >>> > > > > > > > >> >> time
> >>> > > > > > > > >> >> > > > could
> >>> > > > > > > > >> >> > > > >> > > work. I'm not sure if it can, but I'm
> >>> also
> >>> > not
> >>> > > > > sure
> >>> > > > > > > > that it
> >>> > > > > > > > >> >> > can't.
> >>> > > > > > > > >> >> > > > >> Having
> >>> > > > > > > > >> >> > > > >> > > both will be odd. Did you chat about
> >>> this
> >>> > with
> >>> > > > > > > > Yi/Kartik on
> >>> > > > > > > > >> >> the
> >>> > > > > > > > >> >> > > > Samza
> >>> > > > > > > > >> >> > > > >> > side?
> >>> > > > > > > > >> >> > > > >> > I talked with Kartik and realized that
> it
> >>> > would
> >>> > > be
> >>> > > > > > > useful
> >>> > > > > > > > to
> >>> > > > > > > > >> >> have
> >>> > > > > > > > >> >> > a
> >>> > > > > > > > >> >> > > > >> client
> >>> > > > > > > > >> >> > > > >> > timestamp to facilitate use cases like
> >>> stream
> >>> > > > > > > processing.
> >>> > > > > > > > >> >> > > > >> > I was trying to figure out if we can
> >>> simply
> >>> > use
> >>> > > > > client
> >>> > > > > > > > >> >> timestamp
> >>> > > > > > > > >> >> > > > without
> >>> > > > > > > > >> >> > > > >> > introducing the server time. There are
> >>> some
> >>> > > > > discussion
> >>> > > > > > > in
> >>> > > > > > > > the
> >>> > > > > > > > >> >> KIP.
> >>> > > > > > > > >> >> > > > >> > The key problem we want to solve here is
> >>> > > > > > > > >> >> > > > >> > 1. We want log retention and rolling to
> >>> depend
> >>> > > on
> >>> > > > > > server
> >>> > > > > > > > >> clock.
> >>> > > > > > > > >> >> > > > >> > 2. We want to make sure the
> log-assiciated
> >>> > > > timestamp
> >>> > > > > > to
> >>> > > > > > > be
> >>> > > > > > > > >> >> > retained
> >>> > > > > > > > >> >> > > > when
> >>> > > > > > > > >> >> > > > >> > replicas moves.
> >>> > > > > > > > >> >> > > > >> > 3. We want to use the timestamp in some
> >>> way
> >>> > that
> >>> > > > can
> >>> > > > > > > allow
> >>> > > > > > > > >> >> > searching
> >>> > > > > > > > >> >> > > > by
> >>> > > > > > > > >> >> > > > >> > timestamp.
> >>> > > > > > > > >> >> > > > >> > For 1 and 2, an alternative is to pass
> the
> >>> > > > > > > log-associated
> >>> > > > > > > > >> >> > timestamp
> >>> > > > > > > > >> >> > > > >> > through replication, that means we need
> to
> >>> > have
> >>> > > a
> >>> > > > > > > > different
> >>> > > > > > > > >> >> > protocol
> >>> > > > > > > > >> >> > > > for
> >>> > > > > > > > >> >> > > > >> > replica fetching to pass log-associated
> >>> > > timestamp.
> >>> > > > > It
> >>> > > > > > is
> >>> > > > > > > > >> >> actually
> >>> > > > > > > > >> >> > > > >> > complicated and there could be a lot of
> >>> corner
> >>> > > > cases
> >>> > > > > > to
> >>> > > > > > > > >> handle.
> >>> > > > > > > > >> >> > e.g.
> >>> > > > > > > > >> >> > > > >> what
> >>> > > > > > > > >> >> > > > >> > if an old leader started to fetch from
> >>> the new
> >>> > > > > leader,
> >>> > > > > > > > should
> >>> > > > > > > > >> >> it
> >>> > > > > > > > >> >> > > also
> >>> > > > > > > > >> >> > > > >> > update all of its old log segment
> >>> timestamp?
> >>> > > > > > > > >> >> > > > >> > I think actually client side timestamp
> >>> would
> >>> > be
> >>> > > > > better
> >>> > > > > > > > for 3
> >>> > > > > > > > >> >> if we
> >>> > > > > > > > >> >> > > can
> >>> > > > > > > > >> >> > > > >> > find a way to make it work.
> >>> > > > > > > > >> >> > > > >> > So far I am not able to convince myself
> >>> that
> >>> > > only
> >>> > > > > > having
> >>> > > > > > > > >> client
> >>> > > > > > > > >> >> > side
> >>> > > > > > > > >> >> > > > >> > timestamp would work mainly because 1
> and
> >>> 2.
> >>> > > There
> >>> > > > > > are a
> >>> > > > > > > > few
> >>> > > > > > > > >> >> > > > situations
> >>> > > > > > > > >> >> > > > >> I
> >>> > > > > > > > >> >> > > > >> > mentioned in the KIP.
> >>> > > > > > > > >> >> > > > >> > >
> >>> > > > > > > > >> >> > > > >> > > When you are saying it won't work you
> >>> are
> >>> > > > assuming
> >>> > > > > > > some
> >>> > > > > > > > >> >> > particular
> >>> > > > > > > > >> >> > > > >> > > implementation? Maybe that the index
> is
> >>> a
> >>> > > > > > > monotonically
> >>> > > > > > > > >> >> > increasing
> >>> > > > > > > > >> >> > > > >> set of
> >>> > > > > > > > >> >> > > > >> > > pointers to the least record with a
> >>> > timestamp
> >>> > > > > larger
> >>> > > > > > > > than
> >>> > > > > > > > >> the
> >>> > > > > > > > >> >> > > index
> >>> > > > > > > > >> >> > > > >> time?
> >>> > > > > > > > >> >> > > > >> > > In other words a search for time X
> >>> gives the
> >>> > > > > largest
> >>> > > > > > > > offset
> >>> > > > > > > > >> >> at
> >>> > > > > > > > >> >> > > which
> >>> > > > > > > > >> >> > > > >> all
> >>> > > > > > > > >> >> > > > >> > > records are <= X?
> >>> > > > > > > > >> >> > > > >> > It is a promising idea. We probably can
> >>> have
> >>> > an
> >>> > > > > > > in-memory
> >>> > > > > > > > >> index
> >>> > > > > > > > >> >> > like
> >>> > > > > > > > >> >> > > > >> that,
> >>> > > > > > > > >> >> > > > >> > but might be complicated to have a file
> on
> >>> > disk
> >>> > > > like
> >>> > > > > > > that.
> >>> > > > > > > > >> >> Imagine
> >>> > > > > > > > >> >> > > > there
> >>> > > > > > > > >> >> > > > >> > are two timestamps T0 < T1. We see
> >>> message Y
> >>> > > > created
> >>> > > > > > at
> >>> > > > > > > T1
> >>> > > > > > > > >> and
> >>> > > > > > > > >> >> > > created
> >>> > > > > > > > >> >> > > > >> > index like [T1->Y], then we see message
> >>> > created
> >>> > > at
> >>> > > > > T1,
> >>> > > > > > > > >> >> supposedly
> >>> > > > > > > > >> >> > we
> >>> > > > > > > > >> >> > > > >> should
> >>> > > > > > > > >> >> > > > >> > have index look like [T0->X, T1->Y], it
> is
> >>> > easy
> >>> > > to
> >>> > > > > do
> >>> > > > > > in
> >>> > > > > > > > >> >> memory,
> >>> > > > > > > > >> >> > but
> >>> > > > > > > > >> >> > > > we
> >>> > > > > > > > >> >> > > > >> > might have to rewrite the index file
> >>> > completely.
> >>> > > > > Maybe
> >>> > > > > > > we
> >>> > > > > > > > can
> >>> > > > > > > > >> >> have
> >>> > > > > > > > >> >> > > the
> >>> > > > > > > > >> >> > > > >> > first entry with timestamp to 0, and
> only
> >>> > update
> >>> > > > the
> >>> > > > > > > first
> >>> > > > > > > > >> >> pointer
> >>> > > > > > > > >> >> > > for
> >>> > > > > > > > >> >> > > > >> any
> >>> > > > > > > > >> >> > > > >> > out of range timestamp, so the index
> will
> >>> be
> >>> > > > [0->X,
> >>> > > > > > > > T1->Y].
> >>> > > > > > > > >> >> Also,
> >>> > > > > > > > >> >> > > the
> >>> > > > > > > > >> >> > > > >> range
> >>> > > > > > > > >> >> > > > >> > of timestamps in the log segments can
> >>> overlap
> >>> > > with
> >>> > > > > > each
> >>> > > > > > > > >> other.
> >>> > > > > > > > >> >> > That
> >>> > > > > > > > >> >> > > > >> means
> >>> > > > > > > > >> >> > > > >> > we either need to keep a cross segments
> >>> index
> >>> > > file
> >>> > > > > or
> >>> > > > > > we
> >>> > > > > > > > need
> >>> > > > > > > > >> >> to
> >>> > > > > > > > >> >> > > check
> >>> > > > > > > > >> >> > > > >> all
> >>> > > > > > > > >> >> > > > >> > the index file for each log segment.
> >>> > > > > > > > >> >> > > > >> > I separated out the time based log index
> >>> to
> >>> > > KIP-33
> >>> > > > > > > > because it
> >>> > > > > > > > >> >> can
> >>> > > > > > > > >> >> > be
> >>> > > > > > > > >> >> > > > an
> >>> > > > > > > > >> >> > > > >> > independent follow up feature as Neha
> >>> > > suggested. I
> >>> > > > > > will
> >>> > > > > > > > try
> >>> > > > > > > > >> to
> >>> > > > > > > > >> >> > make
> >>> > > > > > > > >> >> > > > the
> >>> > > > > > > > >> >> > > > >> > time based index work with client side
> >>> > > timestamp.
> >>> > > > > > > > >> >> > > > >> > >
> >>> > > > > > > > >> >> > > > >> > > For retention, I agree with the
> problem
> >>> you
> >>> > > > point
> >>> > > > > > out,
> >>> > > > > > > > but
> >>> > > > > > > > >> I
> >>> > > > > > > > >> >> > think
> >>> > > > > > > > >> >> > > > >> what
> >>> > > > > > > > >> >> > > > >> > you
> >>> > > > > > > > >> >> > > > >> > > are saying in that case is that you
> >>> want a
> >>> > > size
> >>> > > > > > limit
> >>> > > > > > > > too.
> >>> > > > > > > > >> If
> >>> > > > > > > > >> >> > you
> >>> > > > > > > > >> >> > > > use
> >>> > > > > > > > >> >> > > > >> > > system time you actually hit the same
> >>> > problem:
> >>> > > > say
> >>> > > > > > you
> >>> > > > > > > > do a
> >>> > > > > > > > >> >> full
> >>> > > > > > > > >> >> > > > dump
> >>> > > > > > > > >> >> > > > >> of
> >>> > > > > > > > >> >> > > > >> > a
> >>> > > > > > > > >> >> > > > >> > > DB table with a setting of 7 days
> >>> retention,
> >>> > > > your
> >>> > > > > > > > retention
> >>> > > > > > > > >> >> will
> >>> > > > > > > > >> >> > > > >> actually
> >>> > > > > > > > >> >> > > > >> > > not get enforced for the first 7 days
> >>> > because
> >>> > > > the
> >>> > > > > > data
> >>> > > > > > > > is
> >>> > > > > > > > >> >> "new
> >>> > > > > > > > >> >> > to
> >>> > > > > > > > >> >> > > > >> Kafka".
> >>> > > > > > > > >> >> > > > >> > I kind of think the size limit here is
> >>> > > orthogonal.
> >>> > > > > It
> >>> > > > > > > is a
> >>> > > > > > > > >> >> valid
> >>> > > > > > > > >> >> > use
> >>> > > > > > > > >> >> > > > >> case
> >>> > > > > > > > >> >> > > > >> > where people only want to use time based
> >>> > > retention
> >>> > > > > > only.
> >>> > > > > > > > In
> >>> > > > > > > > >> >> your
> >>> > > > > > > > >> >> > > > >> example,
> >>> > > > > > > > >> >> > > > >> > depending on client timestamp might
> break
> >>> the
> >>> > > > > > > > functionality -
> >>> > > > > > > > >> >> say
> >>> > > > > > > > >> >> > it
> >>> > > > > > > > >> >> > > > is
> >>> > > > > > > > >> >> > > > >> a
> >>> > > > > > > > >> >> > > > >> > bootstrap case people actually need to
> >>> read
> >>> > all
> >>> > > > the
> >>> > > > > > > data.
> >>> > > > > > > > If
> >>> > > > > > > > >> we
> >>> > > > > > > > >> >> > > depend
> >>> > > > > > > > >> >> > > > >> on
> >>> > > > > > > > >> >> > > > >> > the client timestamp, the data might be
> >>> > deleted
> >>> > > > > > > instantly
> >>> > > > > > > > >> when
> >>> > > > > > > > >> >> > they
> >>> > > > > > > > >> >> > > > >> come to
> >>> > > > > > > > >> >> > > > >> > the broker. It might be too demanding to
> >>> > expect
> >>> > > > the
> >>> > > > > > > > broker to
> >>> > > > > > > > >> >> > > > understand
> >>> > > > > > > > >> >> > > > >> > what people actually want to do with the
> >>> data
> >>> > > > coming
> >>> > > > > > in.
> >>> > > > > > > > So
> >>> > > > > > > > >> the
> >>> > > > > > > > >> >> > > > >> guarantee
> >>> > > > > > > > >> >> > > > >> > of using server side timestamp is that
> >>> "after
> >>> > > > > appended
> >>> > > > > > > to
> >>> > > > > > > > the
> >>> > > > > > > > >> >> log,
> >>> > > > > > > > >> >> > > all
> >>> > > > > > > > >> >> > > > >> > messages will be available on broker for
> >>> > > retention
> >>> > > > > > > time",
> >>> > > > > > > > >> >> which is
> >>> > > > > > > > >> >> > > not
> >>> > > > > > > > >> >> > > > >> > changeable by clients.
> >>> > > > > > > > >> >> > > > >> > >
> >>> > > > > > > > >> >> > > > >> > > -Jay
> >>> > > > > > > > >> >> > > > >> >
> >>> > > > > > > > >> >> > > > >> > On Thu, Sep 10, 2015 at 12:55 PM,
> Jiangjie
> >>> > Qin <
> >>> > > > > > > > >> >> j...@linkedin.com
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > > > >> wrote:
> >>> > > > > > > > >> >> > > > >> >
> >>> > > > > > > > >> >> > > > >> >> Hi folks,
> >>> > > > > > > > >> >> > > > >> >>
> >>> > > > > > > > >> >> > > > >> >> This proposal was previously in KIP-31
> >>> and we
> >>> > > > > > separated
> >>> > > > > > > > it
> >>> > > > > > > > >> to
> >>> > > > > > > > >> >> > > KIP-32
> >>> > > > > > > > >> >> > > > >> per
> >>> > > > > > > > >> >> > > > >> >> Neha and Jay's suggestion.
> >>> > > > > > > > >> >> > > > >> >>
> >>> > > > > > > > >> >> > > > >> >> The proposal is to add the following
> two
> >>> > > > timestamps
> >>> > > > > > to
> >>> > > > > > > > Kafka
> >>> > > > > > > > >> >> > > message.
> >>> > > > > > > > >> >> > > > >> >> - CreateTime
> >>> > > > > > > > >> >> > > > >> >> - LogAppendTime
> >>> > > > > > > > >> >> > > > >> >>
> >>> > > > > > > > >> >> > > > >> >> The CreateTime will be set by the
> >>> producer
> >>> > and
> >>> > > > will
> >>> > > > > > > > change
> >>> > > > > > > > >> >> after
> >>> > > > > > > > >> >> > > > that.
> >>> > > > > > > > >> >> > > > >> >> The LogAppendTime will be set by broker
> >>> for
> >>> > > > purpose
> >>> > > > > > > such
> >>> > > > > > > > as
> >>> > > > > > > > >> >> > enforce
> >>> > > > > > > > >> >> > > > log
> >>> > > > > > > > >> >> > > > >> >> retention and log rolling.
> >>> > > > > > > > >> >> > > > >> >>
> >>> > > > > > > > >> >> > > > >> >> Thanks,
> >>> > > > > > > > >> >> > > > >> >>
> >>> > > > > > > > >> >> > > > >> >> Jiangjie (Becket) Qin
> >>> > > > > > > > >> >> > > > >> >>
> >>> > > > > > > > >> >> > > > >> >>
> >>> > > > > > > > >> >> > > > >> >
> >>> > > > > > > > >> >> > > > >>
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > > >
> >>> > > > > > > > >> >> > > >
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> > > --
> >>> > > > > > > > >> >> > > -- Guozhang
> >>> > > > > > > > >> >> > >
> >>> > > > > > > > >> >> >
> >>> > > > > > > > >> >>
> >>> > > > > > > > >> >
> >>> > > > > > > > >> >
> >>> > > > > > > > >>
> >>> > > > > > > >
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> -- Guozhang
> >>>
> >>
> >>
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-32 - Add CreateTime and LogAppendTime to Kafka message

Reply via email to