Whoops, I guess I didn't finish reading the KIP all the way to the end earlier. Thanks for including the link to the RocksDB PR in the KIP!
I have one additional question about the proposal: do you plan to also add this prefix seek API to the dual column family iterators? These are used by RocksDBTimestampedStore (which extends RocksDBStore), for example the *RocksDBDualCFRangeIterator* Thanks for the KIP! On Thu, May 14, 2020 at 10:50 AM Sagar <sagarmeansoc...@gmail.com> wrote: > Hey @Adam, > > Thanks for sharing your experience with using prefix seek. I did look at > your code for RocksDBPrefixIterator, infact I have repurposed that class > itself since it wasn't being used else where. Regarding how I plan to > expose them through-out the state stores, what I have tried to do is add it > as a separate interface. So, basically, it is not at the same level as the > *range function so to speak. The reason I did that is currently I feel not > all state stores are a natural fit for prefix seek. As I mentioned in the > KIP as well, the current equivalent to it could be BulkLoadingStore(not in > terms of functionality but in terms of how it is also not implemented by > all of them). So, that ways I am not needing to stub them across all the > state-stores and we can implement it only where needed. For example, in the > PR that I have put for reference in the KIP, you can see that I have it > implemented only for RocksDB. > > @Guozhang, > > Thanks for the feedback. Those are very interesting questions and I will > try my best to answer based upon whatever limited understanding I have > developed so far :) > > 1) Regarding the usage of useFixedLengthPrefixExtractor, honestly, I hadn't > looked at that config. I did look it up after you pointed it out and seems > it's more for hash-based memtables? I may be wrong though. But what I would > say is that, the changes I had made were not exactly from a correctness > stand point but more from trying to showcase how we can implement these > changes. The idea was that once we see the merit in this approach then we > can add some of the tunings( and I would need your team's assistance there > :D). > > 2) Regarding the similarity of `RocksDBPrefixIterator` and > `RocksDBRangeIterator`, yes the implementations look more or less similar. > So, in terms of performance, they might be similar. But semantically, they > can solve 2 different use-cases. The range seek is useful when we know both > from and to. But if we consider use-cases where we want to find keys with a > certain prefix, but we don't know if what it's start and end is, then > prefix seek would come in more handy. The point that I am trying to make is > that it can extend the scope of state stores from just point lookups to > somewhat being able to speculative queries where by users can search if a > certain pattern exists. I can vouch for this personally because I wanted to > use state stores for one such use case and since this option wasn't there, > I had to do some other things. An equivalent to this could be SCAN operator > in Redis. (Not trying to compare the Redis and state stores but trying to > give some context). > > Regarding the point on bloom filter, I think there are certain > optimisations that are being talked about in case of prefix seek here: > > > https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#prefix-vs-whole-key > Again > this isn't something that I have explored fully. Also, on the prefix seek > page on RocksDB they mention that there's a prefix iterating technique > called Prefix Bloom Filter. > > 3) Regarding the question on length of bytes for seek v/s prefix seek, I am > not entirely sure about that scenario. What I have understood is that > at-least for Rocks DB, it is more performant for short iterator queries > that longer ones. > > 4) Regarding the last question on placing it within Segment, the reason I > didn't do that way, is that I thought we shouldn't tie this feature only to > RocksDB. I agree that I got this idea while looking/reading about RocksDB > but if we keep it outside the purview of RocksDB and keep it as a pluggable > entity, then a) it remains generic by not being tied to any specific store > and b) no change is needed at all for any of the other stores which haven't > implemented it. > > I am not sure of any of the above points make sense but as I said, this is > based out of my limited understanding of the codebase. So, pardon any > incorrect/illogical statements plz! > > @Sophie, > > Thanks for bringing that point up! I have mentioned about that PR in the > KIP under a section called Other considerations. Nonetheless, thanks for > pointing it out! > > Thanks! > Sagar. > > > On Thu, May 14, 2020 at 5:17 AM Sophie Blee-Goldman <sop...@confluent.io> > wrote: > > > Not to derail this KIP discussion, but to leave a few notes on some of > the > > RocksDB points that have come up: > > > > Someone actually merged some long overdue performance improvements to > > the RocksJava implementation (the PR was opened back in 2017! yikes). > > I haven't looked into the prefix seek API closely enough to know how > > relevant > > this particular change is, and they are still improving things, but it > > gives me some > > faith. > > > > There are some pretty promising results reported on the PR: > > https://github.com/facebook/rocksdb/pull/2283#issuecomment-561563037 > > > > Regarding the custom comparator, they also recently merged this > performance > > <https://github.com/facebook/rocksdb/pull/6252> > > improvement <https://github.com/facebook/rocksdb/pull/6252>. The tl;dr > is > > they reduced the slowdown of a custom comparator in Java > > (relative to the native C++) from ~7x to ~5.2x at best. Which is still > not > > great, but it > > would be interesting to run our own benchmarks and see how this stacks > up. > > > > Of course, these are all new changes and as such will require us to > upgrade > > rocks to 6.x which means they have to wait for us to release a 3.0. But > > there's > > some talk about 3.0 coming in the next few releases so consider it food > for > > not-so-future thought > > > > > > On Tue, May 12, 2020 at 5:02 PM Adam Bellemare <adam.bellem...@gmail.com > > > > wrote: > > > > > Hi Guozhang > > > > > > For clarity, the issues I was running into was not about the actual > > > *prefixSeek* function itself, but about exposing it to the same level > of > > > access as the *range* function throughout Kafka Streams. It required a > > lot > > > of changes, and also required that most state stores stub it out since > it > > > wasn't clear how they would implement it. It was basically an > > overreaching > > > API change that was easily solved (for the specific prefix-scan in FKJ) > > by > > > simply using *range*. So to be clear, the blockers were predominantly > > > around correctly handling the API changes, nothing to do with the > > > mechanisms of the RocksDB prefix scanning. > > > > > > As for KAFKA-5285 I'll look into it more to see if I can get a better > > > handle on the problem! > > > > > > Hope this helps clear it up. > > > > > > Adam > > > > > > > > > On Tue, May 12, 2020 at 7:16 PM Guozhang Wang <wangg...@gmail.com> > > wrote: > > > > > > > Hello Adam, > > > > > > > > I'm wondering if you can provide a bit more context on the blockers > of > > > > using prefixSeek of RocksDB (I saw you have a RocksDBPrefixIterator > > class > > > > but not used anywhere yet)? I'm currently looking at ways to allow > some > > > > secondary indices with rocksDB following some existing approaches > > > > from CockroachDB etc so I'm very curious to learn your experience. > > > > > > > > 1) Before considering any secondary indices, a quick thought is that > > for > > > > (key, timeFrom, timeTo) queries, we can easily replace the current > > > > `range()` impl with a `prefixRange()` impl via a prefix iterator; > > though > > > > for (keyFrom, keyTo, timeFrom, timeTo) it is much more complicated > > indeed > > > > and hence existing `range()` impl may still be used. > > > > > > > > 2) Another related issue I've been pondering for a while is > > > > around KAFKA-5285: with the default lexicograpic byte comparator, > since > > > the > > > > key length varies, the combo (key, window) would have interleaving > byte > > > > layouts like: > > > > > > > > AAA0001 (key AAA, timestamp 0001) > > > > AAA00011 (key AAA0, timestamp 0011) > > > > AAA0002 (key AAA, timestamp 0002) > > > > > > > > which is challenging for prefix seeks to work efficiently. Although > we > > > can > > > > overwrite the byte-comparator in JNI it is very expensive and the > cost > > of > > > > JNI overwhelms its benefits. If you've got some ideas around it > please > > > lmk > > > > as well. > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > On Tue, May 12, 2020 at 6:26 AM Adam Bellemare < > > adam.bellem...@gmail.com > > > > > > > > wrote: > > > > > > > > > Hi Sagar > > > > > > > > > > I implemented a very similar interface for KIP-213, the foreign-key > > > > joiner. > > > > > We pulled it out of the final implementation and instead used > RocksDB > > > > range > > > > > instead. You can see the particular code where we use > > > RocksDB.range(...) > > > > to > > > > > get the same iterator result. > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/foreignkeyjoin/ForeignJoinSubscriptionProcessorSupplier.java#L95 > > > > > > > > > > We pulled it out because there were numerous awkward acrobatics to > > > > > integrate *prefixSeek()* function into the Kafka Streams code. > > > > Basically, I > > > > > wanted to be able to access *prefixSeek()* the same way I can > access > > > > > *range()* for any state store, and in particular use it for storing > > > data > > > > > with a particular foreign key (as per the previous URL). However, I > > > found > > > > > out that it required way too many changes to expose the > > *prefixSeek()* > > > > > functionality while still being able to leverage all the nice Kafka > > > > Streams > > > > > state management + supplier functionality, so we made a decision > just > > > to > > > > > stick with *range()* and pull everything else out. > > > > > > > > > > I guess my question here is, how do you anticipate using > > *prefixSeek()* > > > > > within the framework of Kafka Streams, or the Processor API? > > > > > > > > > > Adam > > > > > > > > > > > > > > > > > > > > On Tue, May 12, 2020 at 2:52 AM Sagar <sagarmeansoc...@gmail.com> > > > wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > I would like to start a discussion on the KIP that I created > below > > to > > > > add > > > > > > prefix scan support in State Stores: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores > > > > > > > > > > > > Thanks! > > > > > > Sagar. > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > >