Our inconsistent usage of magic values like `-1` and `null` to represent missing values makes it hard to reason about code. We had a recent major regression in MirrorMaker as a result of this[1], for example. `null` by itself is hard enough (often called the billion dollar mistake), so can we decide what we prefer and stick with that going forward?
My take is: `Option`/`Optional` > `null` > `-1` unless it's a particularly performance sensitive area where the boxing incurred by having `null` as a possible value is a problem. Since we're still on Java 7, we can't use `Optional` in Java code. Ismael [1] The fix: https://github.com/apache/kafka/commit/4e4e2fb5085758ee9ccf6307433ad531a33198d3 On Fri, Sep 9, 2016 at 3:48 PM, Jun Rao <j...@confluent.io> wrote: > Jiangjie, > > Returning TimestampOffset(-1, -1) sounds reasonable to me. > > Thanks, > > Jun > > On Thu, Sep 8, 2016 at 8:46 PM, Becket Qin <becket....@gmail.com> wrote: > > > Hi Jun, > > > > > 1. latestOffsets() returns the next offset. So there won't be a > timestamp > > > associated with it. Would we use something like -1 for timestamp? > > > > The returned value would high watermark, so there might be an associated > > timestamp. But if it is log end offset, it seems that -1 is a reasonable > > value. > > > > > 2. Jason mentioned that if no message has timestamp >= the provided > > > timestamp, we return a null value for that partition. Could we document > > > that in the wiki? > > > > I made a minor change to return a TimestampOffset(-1, -1) in that case. > Not > > sure which on is better but there seems only minor difference. What do > you > > think? > > > > I haven't seen a planned release date yet, but I can probably get it done > > in 2-3 weeks with reasonable rounds of reviews. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > > > On Thu, Sep 8, 2016 at 6:24 PM, Jun Rao <j...@confluent.io> wrote: > > > > > Hi, Jiangjie, > > > > > > Thanks for the updated KIP. A couple of minor comments. > > > > > > 1. latestOffsets() returns the next offset. So there won't be a > timestamp > > > associated with it. Would we use something like -1 for timestamp? > > > > > > 2. Jason mentioned that if no message has timestamp >= the provided > > > timestamp, we return a null value for that partition. Could we document > > > that in the wiki? > > > > > > BTW, we are getting close to the next release. This is a really nice > > > feature to have. Do you think you will have a patch ready for the next > > > release? > > > > > > Thanks. > > > > > > Jun > > > > > > On Wed, Sep 7, 2016 at 2:47 PM, Becket Qin <becket....@gmail.com> > wrote: > > > > > > > That sounds reasonable to me. I'll update the KIP wiki page. > > > > > > > > On Wed, Sep 7, 2016 at 1:34 PM, Jason Gustafson <ja...@confluent.io> > > > > wrote: > > > > > > > > > Hey Becket, > > > > > > > > > > I don't have a super strong preference, but I think this > > > > > > > > > > earliestOffset(singleton(partition)); > > > > > > > > > > captures the intent more clearly than this: > > > > > > > > > > offsetsForTimes(singletonMap(partition, -1)); > > > > > > > > > > I can understand the desire to keep the API footprint small, but I > > > think > > > > > the use case is common enough to justify separate APIs. A couple > > > > additional > > > > > points: > > > > > > > > > > 1. If we had separate methods, it might make sense to treat > negative > > > > > timestamps as illegal in offsetsForTimes. That seems safer from the > > > user > > > > > perspective since legitimate timestamps should always be positive. > > > > > 2. The expected behavior of offsetsForTimes is to return the > earliest > > > > > offset which is greater than or equal to the passed offset, so > having > > > > > Long.MAX_VALUE return the latest value doesn't seem very intuitive > to > > > > me. I > > > > > would actually expect it to return null. > > > > > > > > > > Given that, I think I prefer having the custom methods. What do you > > > > think? > > > > > > > > > > Thanks, > > > > > Jason > > > > > > > > > > On Wed, Sep 7, 2016 at 1:00 PM, Becket Qin <becket....@gmail.com> > > > wrote: > > > > > > > > > > > Hi Jason, > > > > > > > > > > > > Thanks for the feedback. That is a good point. For the -1 and -2 > > > > > semantics, > > > > > > I was just thinking we will preserve the semantics in the wire > > > > protocol. > > > > > > For the user facing API, I agree that is not intuitive. We can do > > one > > > > of > > > > > > the following: > > > > > > 1. Add two separate methods: earliestOffsets() and > latestOffsets(). > > > > > > 2. just have offsetsForTimes() and return the earliest if the > > > timestamp > > > > > is > > > > > > negative and the latest if the timestamp is Long.MAX_VALUE. > > > > > > > > > > > > The good thing about doing (1) is that we kind of have symmetric > > > > function > > > > > > signatures like seekToBeginning() and seekToEnd(). However, even > if > > > we > > > > do > > > > > > (1), we may still need to do (2) to handle the negative timestamp > > and > > > > the > > > > > > Long.MAX_VALUE timestamp in offsetsForTimes(). Then they > > essentially > > > > > become > > > > > > redundant to earliestOffsets() and latestOffsets(). > > > > > > > > > > > > Personally I prefer option (2) because of the conciseness and it > > > seems > > > > > > intuitive enough. But I am open to option (1) as well. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > On Wed, Sep 7, 2016 at 11:06 AM, Jason Gustafson < > > ja...@confluent.io > > > > > > > > > > wrote: > > > > > > > > > > > > > Hey Becket, > > > > > > > > > > > > > > Thanks for the KIP. As I understand, the intention is to > preserve > > > the > > > > > > > current behavior with a timestamp of -1 indicating latest > > timestamp > > > > and > > > > > > -2 > > > > > > > indicating earliest timestamp. So users can query these offsets > > > using > > > > > the > > > > > > > offsetsForTimes API if they know the magic values. I'm > wondering > > if > > > > it > > > > > > > would make the usage a little nicer to have a separate API > > instead > > > > for > > > > > > > these special cases? Sort of in the way that we expose a > generic > > > > seek() > > > > > > and > > > > > > > a seekToBeginning(), maybe we could have an earliestOffset() in > > > > > addition > > > > > > to > > > > > > > offsetsForTimes()? > > > > > > > > > > > > > > Thanks, > > > > > > > Jason > > > > > > > > > > > > > > On Wed, Sep 7, 2016 at 10:04 AM, Becket Qin < > > becket....@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > Thanks everyone for all the feedback. > > > > > > > > > > > > > > > > If there is no further concerns or comments I will start a > > voting > > > > > > thread > > > > > > > on > > > > > > > > this KIP tomorrow. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > > > > > On Tue, Sep 6, 2016 at 9:48 AM, Becket Qin < > > becket....@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Magnus, > > > > > > > > > > > > > > > > > > Thanks for the comments. I agree that querying messages > > within > > > a > > > > > time > > > > > > > > > range is a valid use case (actually this is an example use > > case > > > > in > > > > > my > > > > > > > > > previous email). The current proposal can achieve this by > > > having > > > > > two > > > > > > > > > ListOffsetRequest, right? I think the current API already > > > > supports > > > > > > the > > > > > > > > use > > > > > > > > > cases that require the offsets for multiple timestamps. The > > > > > question > > > > > > is > > > > > > > > > that whether it is worth adding more complexity to the > > protocol > > > > to > > > > > > make > > > > > > > > it > > > > > > > > > easier for multiple timestamp query. Personally I think > given > > > > that > > > > > > > query > > > > > > > > > multiple timestamps is likely an infrequent operation, > there > > is > > > > no > > > > > > need > > > > > > > > to > > > > > > > > > optimize for it and complicates the protocol. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > > > > > > > > > On Mon, Sep 5, 2016 at 11:21 PM, Magnus Edenhill < > > > > > mag...@edenhill.se > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> Good write-up Qin, the API looks promising. > > > > > > > > >> > > > > > > > > >> I have one comment: > > > > > > > > >> > > > > > > > > >> 2016-09-03 5:20 GMT+02:00 Becket Qin < > becket....@gmail.com > > >: > > > > > > > > >> > > > > > > > > >> > The currently offsetsForTimes() API obviously does not > > > support > > > > > > > > querying > > > > > > > > >> > multiple timestamps for the same partition. It doesn't > > > seems a > > > > > > > feature > > > > > > > > >> for > > > > > > > > >> > ListOffsetRequest v0 either (sounds more like a bug). My > > > > > intuition > > > > > > > is > > > > > > > > >> that > > > > > > > > >> > it's a rare use case. Given it does not exist before and > > we > > > > > don't > > > > > > > see > > > > > > > > a > > > > > > > > >> > strong need from the community either, maybe it is > better > > to > > > > > keep > > > > > > it > > > > > > > > >> simple > > > > > > > > >> > for ListOffsetRequest v1. We can add it later if it > turns > > > out > > > > to > > > > > > be > > > > > > > a > > > > > > > > >> > useful feature (that may need a interface change, but I > > > > honestly > > > > > > do > > > > > > > > not > > > > > > > > >> > think people would frequently query many different > > > timestamps > > > > > for > > > > > > > the > > > > > > > > >> same > > > > > > > > >> > partition) > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> I argue that the current behaviour of OffsetRequest with > > > regards > > > > > to > > > > > > > > >> duplicate partitions is a bug > > > > > > > > >> and think it would be a mistake to move the same semantics > > > over > > > > to > > > > > > > thew > > > > > > > > >> new > > > > > > > > >> ListOffset API. > > > > > > > > >> One use case is that an application may want to know the > > > offset > > > > > > range > > > > > > > > >> between two timestamps, > > > > > > > > >> e.g., for reprocessing, batching, searching, etc. > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> Thanks, > > > > > > > > >> Magnus > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > > >> > Have a good long weekend! > > > > > > > > >> > > > > > > > > > >> > Thanks, > > > > > > > > >> > > > > > > > > > >> > Jiangjie (Becket) Qin > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > On Fri, Sep 2, 2016 at 6:10 PM, Ismael Juma < > > > > ism...@juma.me.uk> > > > > > > > > wrote: > > > > > > > > >> > > > > > > > > > >> > > Thanks for the proposal Becket. Looks good overall, a > > few > > > > > > > comments: > > > > > > > > >> > > > > > > > > > > >> > > ListOffsetResponse => [TopicName [PartitionOffsets]] > > > > > > > > >> > > > PartitionOffsets => Partition ErrorCode Timestamp > > > > [Offset] > > > > > > > > >> > > > Partition => int32 > > > > > > > > >> > > > ErrorCode => int16 > > > > > > > > >> > > > Timestamp => int64 > > > > > > > > >> > > > Offset => int > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > It should be int64 for `Offset` right? > > > > > > > > >> > > > > > > > > > > >> > > Implementation wise, we will migrate to > > > > o.a.k.common.requests. > > > > > > > > >> > > ListOffsetRequest > > > > > > > > >> > > > class on the broker side. > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > Could you clarify what you mean here? We already > > > > > > > > >> > > use o.a.k.common.requests.ListOffsetRequest in > > KafkaApis. > > > > > > > > >> > > > > > > > > > > >> > > long offset = consumer.offsetForTime( > > > > > Collections.singletonMap( > > > > > > > > >> > > topicPartition, > > > > > > > > >> > > > targetTime)).offset; > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > The result of `offsetForTime` is a Map, so we can't > just > > > > call > > > > > > > > >> `offset` on > > > > > > > > >> > > it. You probably meant something like: > > > > > > > > >> > > > > > > > > > > >> > > long offset = consumer.offsetForTime( > > > > > Collections.singletonMap( > > > > > > > > >> > > topicPartition, > > > > > > > > >> > > targetTime)).get(topicPartition).offset; > > > > > > > > >> > > > > > > > > > > >> > > Test searchByTimestamp with CreateTime and > LogAppendTime > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > Do you mean `Test offsetForTime`? > > > > > > > > >> > > > > > > > > > > >> > > And: > > > > > > > > >> > > > > > > > > > > >> > > 1. In KAFKA-1588, the following issue was described > > "When > > > > > > > performing > > > > > > > > >> an > > > > > > > > >> > > OffsetRequest, if you request the same topic and > > partition > > > > > > > > combination > > > > > > > > >> > in a > > > > > > > > >> > > single request more than once (for example, if you > want > > to > > > > get > > > > > > > both > > > > > > > > >> the > > > > > > > > >> > > head and tail offsets for a partition in the same > > > request), > > > > > you > > > > > > > will > > > > > > > > >> get > > > > > > > > >> > a > > > > > > > > >> > > response for both, but they will be the same offset". > > Will > > > > the > > > > > > new > > > > > > > > >> > request > > > > > > > > >> > > version support the use case where multiple timestamps > > are > > > > > > passed > > > > > > > > for > > > > > > > > >> the > > > > > > > > >> > > same topic partition? And if we do support it at the > > > > protocol > > > > > > > level, > > > > > > > > >> do > > > > > > > > >> > we > > > > > > > > >> > > also want to support it at the API level or do we > think > > > the > > > > > > > > additional > > > > > > > > >> > > complexity is not worth it? > > > > > > > > >> > > > > > > > > > > >> > > 2. Is `offsetForTime` the right method name given that > > we > > > > are > > > > > > > > getting > > > > > > > > >> > > multiple offsets? Maybe it should be `offsetsForTimes` > > or > > > > > > > something > > > > > > > > >> like > > > > > > > > >> > > that. > > > > > > > > >> > > > > > > > > > > >> > > Ismael > > > > > > > > >> > > > > > > > > > > >> > > On Wed, Aug 31, 2016 at 4:38 AM, Becket Qin < > > > > > > becket....@gmail.com > > > > > > > > > > > > > > > > >> > wrote: > > > > > > > > >> > > > > > > > > > > >> > > > Hi Kafka devs, > > > > > > > > >> > > > > > > > > > > > >> > > > I created KIP-79 to allow consumer to precisely > query > > > the > > > > > > > offsets > > > > > > > > >> based > > > > > > > > >> > > on > > > > > > > > >> > > > timestamp. > > > > > > > > >> > > > > > > > > > > > >> > > > In short we propose to : > > > > > > > > >> > > > 1. add a ListOffsetRequest/ListOffsetResponse v1, > and > > > > > > > > >> > > > 2. add an offsetForTime() method in new consumer. > > > > > > > > >> > > > > > > > > > > > >> > > > The KIP wiki is the following: > > > > > > > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage. > > > > > > > > >> > > action?pageId=65868090 > > > > > > > > >> > > > > > > > > > > > >> > > > Comments are welcome. > > > > > > > > >> > > > > > > > > > > > >> > > > Thanks, > > > > > > > > >> > > > > > > > > > > > >> > > > Jiangjie (Becket) Qin > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >